thoughts on SHARE
My response to Library Journal’s ARL Launches Library-Led Solution to Federal Open Access Requirements that I’m posting here as well, because I spent a bit of time on it. Thanks for the heads up Dorothea,
Let the nastygrams begin. http://t.co/7THbOt8AmZ
— Ondatra libskoolicus (@LibSkrat) June 13, 2013
In principle I like the approach that SHARE is taking, that of leveraging the existing network of institutional repositories, and the amazingly decentralized thing that is the Internet and the World Wide Web. Simply getting article content out on the Web, where it can be crawled, as Harnad suggests, has bootstrapped incredibly useful services like Google Scholar. Scholar works with the Web we have, not some future Web where we all share metadata perfectly using formats that will be preserved for the ages. They don’t use OpenURL, OAI-ORE, SWORD, etc. They do have lots o’ crawlers, and some magical PDF parsing code that can locate citations. I would like to see a plan that’s a bit scruffier and less neat.
Like Dorothea I have big doubts about building what looks to be a centralized system that will then push out to IRs using SWORD, and support some kind of federated search with OpenURL. Most IRs seem more like research experiments than real applications oriented around access, that could sustain the kind of usage you might see if mainstream media or a MOOC happened to reference their content. Rather than a 4 phase plan, with digital library acronym soup,I’d rather see some very simple things that could be done to make sure that federally funded research is deposited in an IR, and it can be traced back to the grant that funded it. Of course, I can’ resist to throw out a straw man.
Requiring funding agencies to have a URL for each grant, which can be used in IRs seems like it would be the first logical step. Pinging that URL (kind of like a trackback) when there is a resource (article, dataset, etc) associated with the grant would allow the granting institution to know when something was published that referenced that URL. The granting organization could then look at its grants and see which ones lacked a deposit, and follow up with the grantees. They could also examine pingbacks to see which ones are legit or not. Perhaps further on down the line these resources could be integrated into web archiving efforts, but I digress.
There would probably be a bit of curation of these pingbacks, but nothing a big Federal Agency can’t handle right? I think putting data curation first, instead of last, as the icing on the 4 phase cake is important. I don’t underestimate the challenge in requiring a URL for every grant, perhaps some agencies already have them. I think this would put the onus on the Federal agencies to make this work, rather than the publishers (who, like or not, have a commercial incentive to not make it too easy to provide open access) and universities (who must have a way of referencing grants if any of their plan is to work). This would be putting Linked Data first, rather than last, as rainbow sprinkles on the cake.
Sorry if this comes off as a bit ranty or incomprehensible. I wish Aaron were here to help guide us… It is truly remarkable that the OSTP memo was issued, and that we have seen responses from the ARL and the AAP. I hope we’ll see responses from the federal agencies that the memo was actually directed at.
recent Wikipedia citations as JSON
Here is a little webcast about some work in progress to stream recent citations out of Wikipedia. It uses previous work I did on the wikichanges Node library. Beware, I say “um” and “uh” a lot while showing you my terminal window. This idea could very well be brain damaged since it pings the Wikipedia API for the diff of each change in selected Wikipedias, to see if it contains one or more citations. On the plus side, it emits the citations as JSON, which is suitable for downstream apps of some dimensions, which I haven’t thought much about yet. Get in touch if you have some ideas.
maps on the web with a bit of midlife crisis
TL;DR — I created a JavaScript library for getting GeoJSON out of Wikipedia’s API in your browser (and Node.js). I also created a little app that uses it to display Wikipedia articles for things near you that need a photograph/image or editorial help.
I probably don’t need to tell you how much the state of mapping on the Web has changed in the past few years. I was there. I can remember trying to get MapServer set up in the late 1990s, with limited success. I was there squinting at how Adrian Holovaty reverse engineered a mapping API out of Google Maps at chicagocrime.org. I was there when Google released their official API, which I used some, and then they changed their terms of service. I was there in the late 2000s using OpenLayers and TileCache, which were so much more approachable than MapServer was a decade earlier. I’m most definitely not a mapping expert, or even an amateur–but you can’t be a Web developer without occasionally needing to dabble, and pretend you are.
I didn’t realize until very recently how easy the cool kids have made it to put maps on the Web. Who knew that in 2013 there would be an open source JavaScript library that lets you add a map to your page in a few lines, and that it’s in use by Flickr, FourSquare, CraigsList, Wikimedia, the Wall Street Journal, and others? Even more astounding: who knew there would be an openly licensed source of map tiles and data, that was created collaboratively by a project with over a million registered users, and that it would be good enough to be used by Apple? I certainly didn’t even dream about it.
Ok, hold that thought…
So, Wikipedia recently announced that they were making it easy to use your mobile device to add a photograph to a Wikipedia article that lacked an image.
When I read about this I thought it would be interesting to see what Wikipedia articles there are about my current location, and which lacked images, so I could go and take pictures of them. Before I knew it I had a Web app called ici (French for here) that does just that:
Articles that need images are marked with little red cameras. It was pretty easy to add orange markers for Wikipedia articles that had been flagged as needing edits, or citations. Calling it an app is an overstatement: it is just static HTML, JavaScript and CSS that I serve up. HTML’s geolocation features and Wikipedia’s API (which has GeoData enabled) take care of the rest.
After I created the app I got a tweet from a real geo-hacker, Sean Gillies, who asked:
@edsu I’d love to help Wikipedia get some GeoJSON in their API results. Then you could use leafletjs.com/examples/geojs….
— Sean Gillies (@sgillies) May 8, 2013
Sean is right, it would be really useful to have a GeoJSON output from Wikipedia’s API. But I was on a little bit of a tear, so rather than figuring out how to get GeoJSON into MediaWiki and deployed to all the Wikipedia servers I wondered if I could extract ici’s use of the Wikipedia API into a slightly more generalized JavaScript library, that would make it easy to get GeoJSON out of Wikipedia–at least from JavaScript. That quickly resulted in wikigeo.js which is now getting used in ici. Getting GeoJSON from Wikipedia using wikigeo.js is done in just one line, and then adding the GeoJSON to a map in Leaflet can also be done in one line:
geojson([-73.94, 40.67], function(data) { // add the geojson to a Leaflet map L.geoJson(data).addTo(map) }); |
This call results in callback getting some GeoJSON data that looks something like:
{ "type": "FeatureCollection", "features": [ { "id": "http://en.wikipedia.org/wiki/New_York_City", "type": "Feature", "properties": { "name": "New York City" }, "geometry": { "type": "Point", "coordinates": [ -73.94, 40.67 ] } }, { "id": "http://en.wikipedia.org/wiki/Kingston_Avenue_(IRT_Eastern_Parkway_Line)", "type": "Feature", "properties": { "name": "Kingston Avenue (IRT Eastern Parkway Line)" }, "geometry": { "type": "Point", "coordinates": [ -73.9422, 40.6694 ] } }, { "id": "http://en.wikipedia.org/wiki/Crown_Heights_–_Utica_Avenue_(IRT_Eastern_Parkway_Line)", "type": "Feature", "properties": { "name": "Crown Heights – Utica Avenue (IRT Eastern Parkway Line)" }, "geometry": { "type": "Point", "coordinates": [ -73.9312, 40.6688 ] } }, { "id": "http://en.wikipedia.org/wiki/Brooklyn_Children's_Museum", "type": "Feature", "properties": { "name": "Brooklyn Children's Museum" }, "geometry": { "type": "Point", "coordinates": [ -73.9439, 40.6745 ] } }, { "id": "http://en.wikipedia.org/wiki/770_Eastern_Parkway", "type": "Feature", "properties": { "name": "770 Eastern Parkway" }, "geometry": { "type": "Point", "coordinates": [ -73.9429, 40.669 ] } }, { "id": "http://en.wikipedia.org/wiki/Eastern_Parkway_(Brooklyn)", "type": "Feature", "properties": { "name": "Eastern Parkway (Brooklyn)" }, "geometry": { "type": "Point", "coordinates": [ -73.9371, 40.6691 ] } }, { "id": "http://en.wikipedia.org/wiki/Paul_Robeson_High_School_for_Business_and_Technology", "type": "Feature", "properties": { "name": "Paul Robeson High School for Business and Technology" }, "geometry": { "type": "Point", "coordinates": [ -73.939, 40.6755 ] } }, { "id": "http://en.wikipedia.org/wiki/Pathways_in_Technology_Early_College_High_School", "type": "Feature", "properties": { "name": "Pathways in Technology Early College High School" }, "geometry": { "type": "Point", "coordinates": [ -73.939, 40.6759 ] } } ] } |
There are options for broadening the radius, increasing the number of results, and fetching additional properties of the Wikipedia article such as article summaries, images, categories, templates used. Here’s an example using all the knobs:
geojson( [-73.94, 40.67], { limit: 5, radius: 1000, images: true, categories: true, summaries: true, templates: true }, function(data) { L.geoJson(data).addTo(map) } ); |
Which results in GeoJSON like this (abbreviated)
{ "type": "FeatureCollection", "features": [ { "id": "http://en.wikipedia.org/wiki/Silver_Spring,_Maryland", "type": "Feature", "properties": { "name": "Silver Spring, Maryland", "image": "Downtown_silver_spring_wayne.jpg", "templates": [ "-", "Abbr", "Ambox", "Ambox/category", "Ambox/small", "Basepage subpage", "Both", "Category handler", "Category handler/blacklist", "Category handler/numbered" ], "summary": "Silver Spring is an unincorporated area and census-designated place (CDP) in Montgomery County, Maryland, United States. It had a population of 71,452 at the 2010 census, making it the fourth most populous place in Maryland, after Baltimore, Columbia, and Germantown.\nThe urbanized, oldest, and southernmost part of Silver Spring is a major business hub that lies at the north apex of Washington, D.C. As of 2004, the Central Business District (CBD) held 7,254,729 square feet (673,986 m2) of office space, 5216 dwelling units and 17.6 acres (71,000 m2) of parkland. The population density of this CBD area of Silver Spring was 15,600 per square mile all within 360 acres (1.5 km2) and approximately 2.5 square miles (6 km2) in the CBD/downtown area. The community has recently undergone a significant renaissance, with the addition of major retail, residential, and office developments.\nSilver Spring takes its name from a mica-flecked spring discovered there in 1840 by Francis Preston Blair, who subsequently bought much of the surrounding land. Acorn Park, tucked away in an area of south Silver Spring away from the main downtown area, is believed to be the site of the original spring.\n\n", "categories": [ "All articles to be expanded", "All articles with dead external links", "All articles with unsourced statements", "Articles to be expanded from June 2008", "Articles with dead external links from July 2009", "Articles with dead external links from October 2010", "Articles with dead external links from September 2010", "Articles with unsourced statements from February 2007", "Articles with unsourced statements from May 2009", "Commons category template with no category set" ] }, "geometry": { "type": "Point", "coordinates": [ -77.019, 39.0042 ] } }, ... ] } |
I guess this is a long way of saying, if you want to put Wikipedia articles on a map, or otherwise need GeoJSON for Wikipedia articles for a particular location, take a look at wikigeo.js. If you do, and have ideas for making it better, please let me know. Oh, by the way you can npm install wikigeo and use it from Node.js.
I guess JavaScript, HTML5, NodeJS, CoffeeScript are like my midlife crisis…my red sports car. But maybe being the old guy, and losing my edge isn’t really so bad?
I’m losing my edge
to better-looking people
with better ideas
and more talent
and they’re actually
really, really nice.
— Jim Murphy
It definitely helps when the kids coming up from behind have talent and are really, really nice. You know?
Everything is Data
Reassembling the Social: An Introduction to Actor-Network-Theory by Bruno Latour
My rating: 4 of 5 stars
I picked this up because folks over on the Philosophy in a Time of Software kicked things off by discussing this book by Latour. So, I’m really not terribly knowledgeable about sociology, but I did a fair bit of reading in the social sciences while getting my library union card studying library/information science. So I wasn’t completely underwater, but I definitely felt like I was swimming in the deep end. I didn’t get the connection to computer programming until quite late in the book, but it was definitely a bit of a lightbulb moment when I did. Latour’s style (at least that of the unmentioned translator) is refreshingly direct, personal, and unabashedly opinionated. He spends much of the book describing just how complicated social science is, and how far it has gone off the tracks…which is quite entertaining at times.
A few things I will take with me from this book and its portrayal of Actor Network Theory:
I will never be able to say or write the word “social” without feeling like I’m glossing over a whole lot of stuff, and that this stuff is what I should actually be researching, talking and writing about. Latour stresses that it’s important not to dumb things down by appealing to established social forces (class, gender, imperialism, etc) but by tracing the actors, their controversies, and their relations. This work requires discipline because it’s tempting to reduce the complexity by using these familiar abstractions instead of expending energy/effort in documenting the scenarios as faithfully as possible. By letting the actors have a voice, and say what they think they are doing, rather than the researcher telling the actor what they are actually doing. I work in libraries/archives, so I particularly liked Latour’s insistence on the importance notebooks, writing, and documentation:
The best way to proceed at this point … is simply to keep track of all our moves, even those that deal with the very production of the account. This is neither for the sake of epistemic reflexivity nor for some narcissist indulgence into one’s own work, but because from now on everything is data: everything from the first telephone call to a prospective interviewee, the first appointment with the advisor, the first corrections made by a client on a grant proposal, the first launching of a search engine, the first list of boxes to tick in a questionnaire. In keeping with the logic of our interest in textual reports and accounting, it might be useful to list the different notebooks one should keep—manual or digital, it no longer matters much. p. 286.
… and that this is the work of “slowciology” — it requires you to slow down, and really describe/dig into things.
The other really interesting thing about this book for me was the insistence that social actors do not need to be human. It is fairly typical for social science research to focus on face-to-face interaction between people as the primary focus. Latour doesn’t dispute the importance of studying human actors, but emphasizes that it’s useful to increase the number of actors under study by studying objects (mediators) as actors. Typically we think of actors as having agency, free will, etc … but objects are typically complex things, with particular affordances, and extensive relations with other things in the field. You get only a very limited view of what is going on if you don’t trace these relations.
Things, quasi-objects, and attachments are the real center of the social world, not the agent, person, member, or participant—nor is it society or its avatars. (p. 237)
As a software developer, I really identified with Latour’s insistence on the role that objects play in our understanding of activities around us; how this view necessarily complicates things a great deal, and requires us to slow down to really understand/describe what is going on. It is hard work. And it’s only when we understand the various actors and their relations, the actual ones, not the abstract ones in the architecture diagram, or in the theory about the software, that we will be in a position to effectively change things or build anew.
#75
When taxes are too high,
people go hungry.
When the government is too intrusive,
people lose their spirit.Act for the people’s benefit.
Trust them; leave them alone.
python heal thyself
.@adriarichards is currently getting doxed & threatened w/ violence. Search Twitter for her name & report abuse: bit.ly/Y82Ntx
— Gina Trapani (@ginatrapani) March 21, 2013
After seeing Gina’s tweet, I was curious to see if there was any difference by gender in the tweets directed at @adriarichards over the recent controversy at PyCon. I wasn’t confident I would find anything. It was more a feeble attempt to try to make Python make sense of something senseless that happened at PyCon; or to paraphrase Physician, heal thyself…for Python to heal itself.
I used twarc to collect 13,472 tweets that mentioned @adriarichards from the search API. I then added a utility filter that uses genderator to filter the line oriented JSON based on a guess at the gender (Twitter doesn’t track it). genderator identified 2,433 (18%) tweets from women, 5,268 (39%) from men, and 5,771 (42%) that were of unknown gender. I then added another utility that reads a stream of Tweets and generates a tag cloud as a standalone HTML file using d3-cloud.
I put them all together on the command line like this:
% twarc.py @adriarichards % cat @adriarichards-20130321200320.json | utils/gender.py --gender male | utils/wordcloud.py > male.html % cat @adriarichards-20130321200320.json | utils/gender.py --gender female | utils/wordcloud.py > female.html
I realize word clouds aren’t probably the greatest way to visualize the differences in these messages. If you have better ideas let me know. I made the tweet JSON available if you want to try your own visualization.
Looking at these didn’t yield much insight. So instead of visualizing all the words that each gender used, I wondered what the clouds would look like if I limited them to words that were uniquely spoken by each gender. In other words, what words did males use in their tweets which were not used by females, and vice-versa. There were 1,333 (11%) uniquely female words, and 4,767 (39%) uniquely male words, with a shared vocabulary of 5,988 (50%) words.
I’m not sure there is much more insight here either. I guess there is some weak comfort in the knowledge that 1/2 of the words used in these tweets were shared by both sexes.
emoji dick and mo tweets
The news about Emoji Dick (the version of Moby Dick translated into Emoji) being acquired by the Library of Congress prompted me to capriciously go to Twitter Search to see who was talking about it. As I drilled backwards I was surprised to see the search results went back to Fred Benenson’s original Tweet about the project.
I am paying 50 cents a sentence to convertfrom Herman Melville’s Moby Dick into Emoji on Amazon’s Mechanical Turk: http://ping.fm/1cVXy
— Fred Benenson (@fredbenenson) February 10, 2009
That Tweet is from 4 years ago!
Up until recently you could only search back a couple of weeks, tops. The only sad thing is that the Twitter Search API still seems to have the two week window. I used my little twarc utility to drill back in the search results via the API and the earliest it was able to find for the same query was from 2013-02-18.
Hopefully the search window for the API will be opened up at some point, since it is at least theoretically possible now. If you happen to know any of the details about how the search functionality works I would be most grateful to hear from you.
Oh, and of course, I had to request Emoji Dick from the stacks:
PLEASE DO NOT REPLY TO THIS MESSAGE. STATUS: Your request has been received. REQUEST ID: 243106235 SEND TO: Adams Charge Station (LA 5244) - Staff REQUEST RECEIVED: Mon Feb 25 12:56:19 EST 2013 TITLE: Emoji Dick ; or The Whale / by Herman Melville ; Edited and Compiled by Fred Benenson ; Translation by Amazon Mechanical Turk. AUTHOR: Melville, Herman, 1819-1891. CALL#: PS2384 .M6 2012
The one-time-cataloger in me thinks that there was a missed opportunity to add a uniform title to the LC catalog record…. But the title statement of responsibility mentioning that it is a translation made by Amazon Turk more than makes up for that!
Thanks Jay for letting me know what is going on at my own place of work.
brief note on Ernst
Although the traditional archive used to be a rather static memory, the notion of the archive in Internet communication tends to move the archive toward an economy of circulation: permanent tranformations and updating. The so-called cyberspace is not primarily about memory as cultural record but rather about a performantive form of memory as communication. Within this economy of permanent recycling of information, there is less need for emphatic but short-term, updatable memory, which comes close to the operative storage management in the von Neumann architecture of computing. Repositories are no longer final destinations but turn into frequently accessed sites. Archives become cybernetic systems. The aesthetics of fixed order is being replaced by permanent reconfigurability.
Wolfgang Ernst. “Archives in Transition.” Digital Memory and the Archive.
I was reading this and remembering Kevin Kelly’s idea of movage, and the idea of relay supporting archives from Janée et al. I really like the way Ernst works this idea into the way the Internet works, and the ways that the Web transforms the archival function. I’m only half way through the book, and will likely have more to say when I do, so just taking some notes for myself, carry on…
genealogy of a braeburn
It has been observed that when systems break down we get to actually see how they operate. I wonder what this breakage below says about the use of Freebase and Wikipedia data in Google’s Knowlege Graph.
Yes, that’s an image of Braeburn from My Little Pony to the right, and text about the apple to the left. Interestingly it’s fine at Wikipedia:
And it’s not even there in Freebase (according to a search).
I don’t know if this reveals what’s going on in the flow of entities between Wikipedia, Freebase and Google. But I thought it was interesting. I wonder where to report such an anomaly. Is there a place?
Thanks to Jeff Godin in #code4lib for noticing the breakage in Knowledge Graph.
See also Hilary Mason’s post about how her identity got mixed up on Bing. (Thanks Chris).
Update: 2012-02-04
I thought to check a week later, and the The Knowledge Graph results got even funnier, now it’s a collage of apples and My Little Pony:

aaronsw
Aaron Swartz left us all a week ago. It’s strange, I only met Aaron once at the Internet Archive, and had a handful of conversations with him via email/irc … but not a day has passed since last Saturday that I haven’t thought about him, and his principled life.
I’ve been asked a few times why Aaron has been on my mind so much, and I’ve struggled to put it into words. Meanwhile, so many thoughtful things have been written about him. The arc of his life, his ideals, and abilities, charisma, and chutzpah, seem larger than life. And yet, he was just a person, a son, a friend, with people who loved him. It’s just heartbreaking.
I work as a software developer in libraryland, trying to bridge the world of information we’ve had with the world we are building on the Web. So for me, Aaron was a role model, a teacher whose lessons weren’t in textbooks or scholarly journals, but in his blog, in his code, in his talks, in his experiments with real world results. He was only 26 when he died, but he was, and remains, as Tim Berners-Lee paradoxically called him, a “wise elder”.
I wanted to write something here, but more than that I wanted to do something.
I noticed that Internet Archive created a collection devoted to online material related to Aaron, and thought I would try to collect together all the Twitter conversations that mention him. Twitter’s search is limited to the last week, so I quickly wrote a command line utility that pages through search results using their API, and writes out the complete data as line-oriented JSON. I also pulled in the tweets that mention #pdftribute since they were largely inspired by Aaron’s efforts in the open access space. I packaged up the data using BagIt and put it up at Internet Archive. Here’s the description from the bag-info.txt
On January 11, 2013 the Internet activist Aaron Swartz took his own life, and a great deal of grief, anger, and constructive thinking erupted on the Web and in Twitter. In particular the #pdftribute Twitter tag was born, in an attempt to raise awareness about Open Access issues, that Aaron did so much to futher during his life.
This package contains Twitter JSON data for two Twitter search queries that were collected in the week following Aaron’s death:
- “Aaron Swartz” OR aaronsw
- #pdftribute
aaronsw.json.gz contains 630,397 tweets, for the period starting with 2013-01-11 16:50:22 and ending 2013-01-18 13:50:02.
pdftribute.json.gz contains 42,277 tweets, for the period starting with Jan 13 02:42:26 and ending Jan 17 03:33:46.
In addition the URLs mentioned in the tweets found in aaronsw.tar.gz were extracted, unshortened, and then aggregated to provide a report of what people linked to. These URLs are available in aaronsw-urls.txt.gz.
It is hoped that this data will help document the Web community’s response to Aaron’s death, and life.
Below is a list of the top 50 links shared in tweets about Aaron. There were 36,506 in all.
There were 209,839 Twitter users that mentioned Aaron on Twitter in the last week. I was one of them. I wish I could’ve done more to help.
Fielding notes
I’ve been doing a bit of research into the design of the Web for a paper I’m trying to write. In my travels I ran across Jon Udell’s 2006 interview with Roy Fielding. The interview is particularly interesting because of Roy’s telling of how (as a graduate student) he found himself working on libwww-perl which helped him discover the architecture of the Web that was largely documented by Tim Berners-Lee’s libwww HTTP library for Objective-C.
For the purposes of note taking, and giving some web spiders some text to index, here are a few moments that stood out:
Udell: A little later on [in Roy's dissertation] you talk about how systems based on what you call control messages are in a very different category from systems where the decisions that get made are being made by human beings, and that that’s, in a sense, the ultimate rationale for designing data driven systems that are web-like, because people need to interact with them in lots of ways that you can’t declaratively define.
Fielding: Yeah, it’s a little bit easier to say that people need to reuse them, in various unanticipated ways. A lot of people think that when they are building an application that they are building something that’s going to last forever, and almost always that’s false. Usually when they are building an application the only thing that lasts forever is the data, at least if you’re lucky. If you’re lucky the data retains some semblance of archivability, or reusability over time.
…
Udell: There is a meme out there to the effect that what we now call REST architectural style was in a sense discovered post facto, as opposed to having been anticipated from the beginning. Do you agree with that or not?
Fielding: No, it’s a little bit of everything, in the sense that there are core principles involved that Berners-Lee was aware of when he was working on it. I first talked to Tim about what I was calling the HTTP Object Model at the time, which is a terrible name for it, but we talked when I was at the W3C in the summer of 95, about the software engineering principles. Being a graduate student of software engineering, that was my focus, and my interest originally. Of course all the stuff I was doing for the Web that was just for fun. At the time that was not considered research.
Udell: But did you at the time think of what you then called the HTTP object model as being in contrast to more API like and procedural approaches?
Fielding: Oh definitely. The reason for that was that the first thing I did for the Web was statistical analysis software, which turned out to be very effective at helping people understand the value of communicating over the Web. The second thing was a program called MOMSpider. It was one of the first Web spiders, a mechanism for testing all the links that were on the Web.
Udell: And that was when you also worked on libwww-perl?
Fielding: Right, and … at the time it was only the second protocol library available for the Web. It was a combination of pieces from various sources, as well as a lot of my own work, in terms of filling out the details, and providing an overall view of what a Web client should do with an HTTP library. And as a result of that design process I realized some of the things Tim Berners-Lee had designed into the system. And I also found a whole bunch of cases where the design didn’t make any sense, or the way it had been particularly implemented over at NCSA, or one of the other clients, or various history of the Web had turned out to be not-fitting with the rest of the design. So that led to a lot of discussions with the other early protocol developers particularly people like Rob McCool, Tony Sanders and Ari Luotonen–people who were building their own systems and understood both what they were doing with the Web, and also what complaints they were getting from their users. And from that I distilled a model of basically what was the core of HTTP. Because if you look back in the 93/94 time frame, the HTTP specification did not look all that similar to what it does now. It had a whole range of methods that were never used, and a lot of talk about various aspects of object orientation which never really applied to HTTP. And all of that came out of Tim’s original implementation of libwww, which was an Objective-C implementation that was trying to be as portable as possible. It had a lot of the good principles of interface separation and genericity inside the library, and really the same principles that I ended up using in the Perl library, although they were completely independently developed. It was just one of those things where that kind of interaction has a way of leading to a more extensible design.
Udell: So was focusing down on a smaller set of verbs partly driven by the experience of having people starting to use the Web, and starting to experience what URLs could be in a human context as well as in a programmatic context?
Fielding: Well, that was really a combination of things. One that’s a fairly common paradigm: if you are trying to inter-operate with people you’ve never met, try to keep it as simple as possible. There’s also just inherent in the notion of using URIs to identify everything, which is of course really the basis of what the Web is, provides you with that frame of mind where you have a common resource, and you want to have a common resource interface.
spotify vs rdio 2012
Back in August of 2011 I wrote a little utility that pulled down Alf Eaton’s Album of the Year data. AOTY is nice for two reasons: a) I like Alf’s taste in music, so the lists are relevant to me; and b) AOTY is a nice example of layering structured metadata into HTML, for easy processing (aka scraping). With the data in hand it was easy to to check to see if the albums were available on the streaming services Spotify and Rdio using their respective APIs. I was trying to decide which one to use at the time, and wanted to know if there was any significant difference in their catalogs.
Back then, it looked like 32% of the albums were available on Spotify, and 46% on Rdio. Alf has updated his list for 2012 so I decided to rerun aotycmp, and it appears that coverage of both has improved, with Spotify (41%) closing the gap a bit closer with Rdio (49%) which still has a comfortable lead. If you want the availability data I’ve updated it on Github.
I’ve been very happy with Rdio, although pieces like Damon Krukowski’s (thanks @dchud) make me wish there was a better way to a) stream music while b) actually putting money in the artists pockets. I’d love to have the ability to pay a little bit more if I knew it was going to the help support the artist in creating more of their art.
Darth Nader
This may be a bad/shortlived idea, but as part of a New Year’s resolution to write more varied material I’m going to try to use my blog (partly) as a dream journal. This will probably drive the few readers I have away, but I’m hoping it might provide some amusement. I barely remember my dreams these days, and would like to remember more of them, so here goes. Feel free to file under TMI.
Walking into a cafe/restaurant in the morning, in what feels like New York, but I’m not sure…it could be any city. It’s a cosy, narrow setup, with all the seats taken by people quietly chatting. I manage to get a cup of coffee to go, and stand waiting for a table to open up. I discover a staircase and vaguely remember that there is seating upstairs. I go up the stairs carefully balancing my wide bowl-like cup of coffee.
The upstairs area is quite large and sprawling, dimly lit, with comfortable chairs, wider tables, and in the middle is a life sized sculpture of a woman in motion, looking behind, while walking–who apparently is the owner of the establishment. A hostess shows me to a table nearby, and says she can’t remember the name of the server, but that someone would be with me shortly. I sit down with my coffee.
After just a few minutes I notice that it feels like evening. There are lots of conversations going on nearby, which I’m able to hear fairly easily. One man in his early 30s is standing at his table, and in a kind of spotlight. He is talking quietly, as if on stage, not obviously on a cell phone, about a meeting that he has just had, and how they will need to travel to Austin, Texas to help protect some geographic area. I can’t remember the exact details of what he was saying but it is clear he is working for an organization that is trying to save some ecosystem features in Austin.
There is a bookshelf nearby with a disembodied head on it, which looks like Ralph Nader, and also a bit like Darth Vader when Luke takes his helmet off at the end of Return of the Jedi. The head is animated, and seems to be simulating the other half of the conversation. He is saying that this is important work, and is similar to a recent project in Seattle. The conversation ends, and the man walks out of the coffee shop.
I notice three other people, with big thick, Ginsbergian beards also leave their tables at the same time, deep in conversation, about something different. There is a counter-culture, occupy-like feeling in the air, of people steadily working to make there corner of the world a better place, it’s a good feeling.
Afterword
Half awake I found myself thinking about the talking head, and how it reminded me of LibraryBox. It was as if the head made it possible to easily tune into public conversations that were going on in the local context of the coffee shop…and it served as an archive or store of these conversations for others to discover later. I don’t know if LibraryBox actually lets any of that happen, but it’s something I’ve been meaning to learn more about in the new year.
By the way, dream interpretations as comments are most welcome…
archiving tweets
If you are an active Twitter user you may have heard that you can now download your complete archive of tweets. The functionality is still being rolled out across the millions of accounts, so don’t be surprised if you don’t see the function yet in your settings.
The WSJ piece kind of joked about the importance of this move on Twitter’s part, which is a bit unfortunate, since it’s a pretty important issue. Yes you can use a 3rd party apps for downloading your Twitter data, but it says a lot when a company takes “archiving” seriously enough to offer it as a service to its users.
If you work in the digital preservation space it’s kind of fun to take a look at the way that Twitter makes these personal archives available. Luckily (if you don’t have the archive download button yet like me) Dave Winer has started collecting some archives, and making them publicly available for browsing and download off of S3. For example we can look at Sarah Bourne’s (who tipped me off to Dave’s work–thanks Sarah!). After you’ve downloaded the ZIP file you get a directory that looks like:
sarahebourne/ |-- css | `-- application.min.css |-- data | |-- csv | | |-- 2008_08.csv | | |-- 2008_09.csv | | |-- 2008_10.csv | | |-- 2008_11.csv | | |-- 2008_12.csv | | |-- 2009_01.csv | | |-- 2009_02.csv | | |-- 2009_03.csv | | |-- 2009_04.csv | | |-- 2009_05.csv | | |-- 2009_06.csv | | |-- 2009_07.csv | | |-- 2009_08.csv | | |-- 2009_09.csv | | |-- 2009_10.csv | | |-- 2009_11.csv | | |-- 2009_12.csv | | |-- 2010_01.csv | | |-- 2010_02.csv | | |-- 2010_03.csv | | |-- 2010_04.csv | | |-- 2010_05.csv | | |-- 2010_06.csv | | |-- 2010_07.csv | | |-- 2010_08.csv | | |-- 2010_09.csv | | |-- 2010_10.csv | | |-- 2010_11.csv | | |-- 2010_12.csv | | |-- 2011_01.csv | | |-- 2011_02.csv | | |-- 2011_03.csv | | |-- 2011_04.csv | | |-- 2011_05.csv | | |-- 2011_06.csv | | |-- 2011_07.csv | | |-- 2011_08.csv | | |-- 2011_09.csv | | |-- 2011_10.csv | | |-- 2011_11.csv | | |-- 2011_12.csv | | |-- 2012_01.csv | | |-- 2012_02.csv | | |-- 2012_03.csv | | |-- 2012_04.csv | | |-- 2012_05.csv | | |-- 2012_06.csv | | |-- 2012_07.csv | | |-- 2012_08.csv | | |-- 2012_09.csv | | |-- 2012_10.csv | | |-- 2012_11.csv | | `-- 2012_12.csv | `-- js | |-- payload_details.js | |-- tweet_index.js | |-- tweets | | |-- 2008_08.js | | |-- 2008_09.js | | |-- 2008_10.js | | |-- 2008_11.js | | |-- 2008_12.js | | |-- 2009_01.js | | |-- 2009_02.js | | |-- 2009_03.js | | |-- 2009_04.js | | |-- 2009_05.js | | |-- 2009_06.js | | |-- 2009_07.js | | |-- 2009_08.js | | |-- 2009_09.js | | |-- 2009_10.js | | |-- 2009_11.js | | |-- 2009_12.js | | |-- 2010_01.js | | |-- 2010_02.js | | |-- 2010_03.js | | |-- 2010_04.js | | |-- 2010_05.js | | |-- 2010_06.js | | |-- 2010_07.js | | |-- 2010_08.js | | |-- 2010_09.js | | |-- 2010_10.js | | |-- 2010_11.js | | |-- 2010_12.js | | |-- 2011_01.js | | |-- 2011_02.js | | |-- 2011_03.js | | |-- 2011_04.js | | |-- 2011_05.js | | |-- 2011_06.js | | |-- 2011_07.js | | |-- 2011_08.js | | |-- 2011_09.js | | |-- 2011_10.js | | |-- 2011_11.js | | |-- 2011_12.js | | |-- 2012_01.js | | |-- 2012_02.js | | |-- 2012_03.js | | |-- 2012_04.js | | |-- 2012_05.js | | |-- 2012_06.js | | |-- 2012_07.js | | |-- 2012_08.js | | |-- 2012_09.js | | |-- 2012_10.js | | |-- 2012_11.js | | `-- 2012_12.js | `-- user_details.js |-- img | |-- bg.png | `-- sprite.png |-- index.html |-- js | `-- application.min.js |-- lib | |-- bootstrap | | |-- bootstrap-dropdown.js | | |-- bootstrap.min.css | | |-- bootstrap-modal.js | | |-- bootstrap-tooltip.js | | |-- bootstrap-transition.js | | |-- glyphicons-halflings.png | | `-- glyphicons-halflings-white.png | |-- hogan | | `-- hogan-2.0.0.min.js | |-- jquery | | `-- jquery-1.8.3.min.js | |-- twt | | |-- sprite.png | | |-- sprite.rtl.png | | |-- twt.all.min.js | | `-- twt.min.css | `-- underscore | `-- underscore-min.js `-- README.txt
So why is this interesting?
The Data
The archive includes data both as CSV and as JavaScript. The CSV is perfect for throwing into a spreadsheet, and doing stuff with it there. The JavaScript is actually a very light shim over some JSON data that is quite a bit richer than the CSV. The JavaScript shim is needed so that it can be used by the app that comes in the archive (more on that later). For example here’s a randomly picked tweet from Sarah:
@monkchips Ouch. Some regrets are harsher than others.
— Sarah Bourne (@sarahebourne) December 19, 2012
Here is how the Tweet shows up in the CSV:
"tweet_id","in_reply_to_status_id","in_reply_to_user_id","retweeted_status_id","retweeted_status_user_id","timestamp","source","text","expanded_urls" "281405942321532929","281400879465238529","61233","","","2012-12-19 14:29:39 +0000","<a href=""http://janetter.net/"" rel=""nofollow"">Janetter</a>","@monkchips Ouch. Some regrets are harsher than others." |
And here’s the archived JSON for the Tweet:
{ "source" : "<a href=\"http://janetter.net/\" rel=\"nofollow\">Janetter</a>", "entities" : { "user_mentions" : [ { "name" : "James Governor", "screen_name" : "monkchips", "indices" : [ 0, 10 ], "id_str" : "61233", "id" : 61233 } ], "media" : [ ], "hashtags" : [ ], "urls" : [ ] }, "in_reply_to_status_id_str" : "281400879465238529", "geo" : { }, "id_str" : "281405942321532929", "in_reply_to_user_id" : 61233, "text" : "@monkchips Ouch. Some regrets are harsher than others.", "id" : 281405942321532929, "in_reply_to_status_id" : 281400879465238529, "created_at" : "Wed Dec 19 14:29:39 +0000 2012", "in_reply_to_screen_name" : "monkchips", "in_reply_to_user_id_str" : "61233", "user" : { "name" : "Sarah Bourne", "screen_name" : "sarahebourne", "protected" : false, "id_str" : "16010789", "profile_image_url_https" : "https://si0.twimg.com/profile_images/638441870/Snapshot-of-sb_normal.jpg", "id" : 16010789, "verified" : false } } |
So there’s quite a bit more structured data in the archived JSON including whether geo coordinates, hash tags, urls mentioned, etc. Also, the avatar images are still referenced out on the Web, where they can change, disappear, etc. It’s also interesting to compare the archived JSON against what you get back the from Twitter API for the same Tweet:
{ "user": { "follow_request_sent": false, "profile_use_background_image": true, "default_profile_image": false, "id": 16010789, "verified": false, "profile_text_color": "080C0C", "profile_image_url_https": "https://si0.twimg.com/profile_images/638441870/Snapshot-of-sb_normal.jpg", "profile_sidebar_fill_color": "FCFAEF", "entities": { "url": { "urls": [ { "url": "http://www.linkedin.com/in/sarahbourne", "indices": [ 0, 38 ], "expanded_url": null } ] }, "description": { "urls": [] } }, "followers_count": 2367, "profile_sidebar_border_color": "FFFFFF", "id_str": "16010789", "profile_background_color": "DAE0D9", "listed_count": 331, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/671143407/8544adf04bc3823d306c7f05efef2351.jpeg", "utc_offset": -18000, "statuses_count": 20090, "description": "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests.", "friends_count": 784, "location": "Boston, MA, USA", "profile_link_color": "800326", "profile_image_url": "http://a0.twimg.com/profile_images/638441870/Snapshot-of-sb_normal.jpg", "following": true, "geo_enabled": false, "profile_banner_url": "https://si0.twimg.com/profile_banners/16010789/1348096060", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/671143407/8544adf04bc3823d306c7f05efef2351.jpeg", "screen_name": "sarahebourne", "lang": "en", "profile_background_tile": true, "favourites_count": 3147, "name": "Sarah Bourne", "notifications": null, "url": "http://www.linkedin.com/in/sarahbourne", "created_at": "Wed Aug 27 12:24:25 +0000 2008", "contributors_enabled": false, "time_zone": "Eastern Time (US & Canada)", "protected": false, "default_profile": false, "is_translator": false }, "favorited": false, "entities": { "user_mentions": [ { "id": 61233, "indices": [ 0, 10 ], "id_str": "61233", "screen_name": "monkchips", "name": "James Governor" } ], "hashtags": [], "urls": [] }, "contributors": null, "truncated": false, "text": "@monkchips Ouch. Some regrets are harsher than others.", "created_at": "Wed Dec 19 14:29:39 +0000 2012", "retweeted": false, "in_reply_to_status_id_str": "281400879465238529", "coordinates": null, "in_reply_to_user_id_str": "61233", "source": "<a href=\"http://janetter.net/\" rel=\"nofollow\">Janetter</a>", "in_reply_to_status_id": 281400879465238529, "in_reply_to_screen_name": "monkchips", "id_str": "281405942321532929", "place": null, "retweet_count": 0, "geo": null, "id": 281405942321532929, "in_reply_to_user_id": 61233 } |
Using json-diff it’s not too difficult to see what the differences are between the archived version and the API version:
{ + favorited: false + contributors: null + truncated: false + retweeted: false + coordinates: null + place: null + retweet_count: 0 entities: { - media: [ - ] } - geo: { - } + geo: null user: { + follow_request_sent: false + profile_use_background_image: true + default_profile_image: false + profile_text_color: "080C0C" + profile_sidebar_fill_color: "FCFAEF" + entities: { + url: { + urls: [ + { + url: "http://www.linkedin.com/in/sarahbourne" + indices: [ + 0 + 38 + ] + expanded_url: null + } + ] + } + description: { + urls: [ + ] + } + } + followers_count: 2367 + profile_sidebar_border_color: "FFFFFF" + profile_background_color: "DAE0D9" + listed_count: 331 + profile_background_image_url_https: "https://si0.twimg.com/profile_background_images/671143407/8544adf04bc3823d306c7f05efef2351.jpeg" + utc_offset: -18000 + statuses_count: 20090 + description: "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests." + friends_count: 784 + location: "Boston, MA, USA" + profile_link_color: "800326" + profile_image_url: "http://a0.twimg.com/profile_images/638441870/Snapshot-of-sb_normal.jpg" + following: true + geo_enabled: false + profile_banner_url: "https://si0.twimg.com/profile_banners/16010789/1348096060" + profile_background_image_url: "http://a0.twimg.com/profile_background_images/671143407/8544adf04bc3823d306c7f05efef2351.jpeg" + lang: "en" + profile_background_tile: true + favourites_count: 3147 + notifications: null + url: "http://www.linkedin.com/in/sarahbourne" + created_at: "Wed Aug 27 12:24:25 +0000 2008" + contributors_enabled: false + time_zone: "Eastern Time (US & Canada)" + default_profile: false + is_translator: false } } |
To be fair some of the user profile information has been normalized in the archive (perhaps to save space for the viewing application) out to a user_details.js file, which looks like:
{ "screen_name" : "sarahebourne", "location" : "Boston, MA, USA", "full_name" : "Sarah Bourne", "bio" : "Internet technology strategist, Accessibility and assistive technologies. Views expressed/implied are my own. See my Twitter lists for more interests.", "id" : "16010789", "created_at" : "Wed Aug 27 12:24:25 +0000 2008" } |
Notably missing from this is a homepage for the user, their number of favourites, their number of friends, followers, whether geo is enabled, etc.
All these details aside, Twitter deserves a lot of credit for making the data available as CSV for ease of use, and also as JavaScript for programmatic use.
The Code
So the really, really neat thing about the archive is that it comes with a pure HTML, CSS and JavaScript application that you can open locally in your browser and view your archive. It looks pretty, for example here is Sarah’s archive that Dave Winer mounted up on S3. It even has a keyword search across all your tweets, which takes a bit of time (it interactively loads all your tweet JavaScript files mentioned above), but it works. You can zip the data up, give it to someone else, and it all just works.
The archive uses some third party libraries such as jQuery, Underscore, Twitter Bootstrap and Hogan, which all come minified and bundled statically in the archive. The application itself is called Grailbird and comes minified as well. Grailbird loads the static JavaScript (as needed) and displays it. The only network traffic I saw while it was running was fetching avatar images.
Assuming JavaScript backwards compatibility, and browser support for JavaScript, the Twitter archive’s contextual display for the underlying data could last a long, long time. At least that’s a possible interpretation based on David Rosenthal’s hypothesis about the Web’s effect on format obsolescence. I think it’s safe to say that this app written for the local Web platform is likely last longer than a GUI application written in another language environment. The separation of code and data, and independence from a particular browser implementation are big wins. These are qualities that we all had to fight and work hard for on the Web, and I think it makes sense to re-purpose them here in an archival context.
I doubt anyone from Twitter has read this far, but if someone has, it would be great to see Grailbird show up with the other great stuff you have released to Github. I found myself wanting to quickly search across tweets looking for things, like geo-enabled tweets (to make sure that they are there). I could look at the minified Grailbird source in Chrome using developer tools, but it wasn’t good enough for me to figure out how to dynamically load data. I resorted to using NodeJS, and evaling the JavaScript files…and was able to confirm that there is geo data in the archives if you have it enabled. Here’s the simplistic script I came up with:
var fs = require('fs'); var Grailbird = {data: {}}; // load all the tweet data eval(fs.readFileSync("data/js/tweet_index.js", "utf8")); for (var i = 0; i < tweet_index.length; i++) { eval(fs.readFileSync(tweet_index[i].file_name, "utf8")); } // look at each tweet and print out the date and geolocation if it's there for (var slice in Grailbird.data) { for (var j = 0; j < Grailbird.data[slice].length; j++) { var tweet = Grailbird.data[slice][j]; if (tweet.geo.coordinates) console.log(tweet.created_at, ",", tweet.geo.coordinates.join(",")); } } |
and the output for Jeremy Keith’s archive.
% node geo.js Fri Nov 30 13:08:33 +0000 2012,50.8262027605,-0.138112306595 Sat Nov 17 12:09:18 +0000 2012,54.6000387923,-5.9254288673 Fri Nov 16 22:32:03 +0000 2012,54.5925614526,-5.930852294 Thu Nov 15 13:35:35 +0000 2012,54.595909,-5.922033 Sat Nov 10 12:59:37 +0000 2012,50.825832,-0.142381 Fri Nov 09 13:54:51 +0000 2012,50.8262027605,-0.1381123066 Wed Nov 07 18:07:24 +0000 2012,50.825977,-0.138339 Tue Nov 06 16:58:49 +0000 2012,50.8378257671,-1.1800042739 Tue Oct 30 11:19:53 +0000 2012,50.8262027605,-0.1381123066 Thu Oct 18 17:51:22 +0000 2012,43.0733634985,-89.38608062 Tue Oct 16 17:29:20 +0000 2012,43.0872606735,-89.3659955263 Tue Oct 09 18:11:20 +0000 2012,40.7406891129,-74.0076184273 Sun Oct 07 14:27:50 +0000 2012,50.82906975,-0.126056 Sat Oct 06 16:29:30 +0000 2012,50.825832,-0.142381 Thu Oct 04 16:46:56 +0000 2012,50.8262027605,-0.1381123066 Tue Oct 02 17:46:42 +0000 2012,50.826646,-0.136921 Mon Oct 01 10:46:04 +0000 2012,50.8262027605,-0.1381123066 Mon Oct 01 10:43:46 +0000 2012,50.8262027605,-0.1381123066 Mon Oct 01 09:38:01 +0000 2012,50.8236703111,-0.1387184062 Mon Oct 01 08:53:15 +0000 2012,50.8236703111,-0.1387184062 Thu Sep 27 13:05:16 +0000 2012,59.915652,10.749959 Sun Sep 23 12:54:16 +0000 2012,50.8281663943,-0.128531456 Sat Sep 22 13:44:09 +0000 2012,50.87447886,0.017625 Thu Sep 20 13:16:11 +0000 2012,50.8262027605,-0.1381123066 Thu Sep 20 09:27:55 +0000 2012,50.8262027605,-0.1381123066 Mon Sep 17 07:51:20 +0000 2012,47.9952739036,7.8525775405 Sun Sep 16 09:01:28 +0000 2012,51.1599172667,-0.1787844393 Thu Sep 13 12:40:26 +0000 2012,50.822951,-0.136905 Tue Sep 11 18:41:47 +0000 2012,50.822746,-0.142274 Tue Sep 11 17:19:38 +0000 2012,50.822219,-0.140802 Tue Sep 11 13:05:59 +0000 2012,50.8262027605,-0.1381123066 Tue Sep 11 13:03:35 +0000 2012,50.8262027605,-0.1381123066 Tue Sep 11 12:48:51 +0000 2012,50.8262027605,-0.1381123066 Tue Sep 11 12:06:36 +0000 2012,50.8262027605,-0.1381123066 Tue Sep 11 08:23:00 +0000 2012,50.8262027605,-0.1381123066 Sun Sep 09 19:10:21 +0000 2012,50.826646,-0.136921 Tue Sep 04 17:33:44 +0000 2012,50.826646,-0.136921 Tue Sep 04 12:57:16 +0000 2012,50.822951,-0.136905 Mon Sep 03 16:03:37 +0000 2012,50.8262027605,-0.1381123066 Mon Sep 03 15:26:41 +0000 2012,50.8262027605,-0.1381123066 Sun Sep 02 19:40:38 +0000 2012,50.8229428584,-0.1390289018 Sun Sep 02 19:24:45 +0000 2012,50.8229428584,-0.1390289018 Sun Sep 02 19:08:55 +0000 2012,50.825977,-0.138339 Sun Sep 02 18:25:08 +0000 2012,50.825449,-0.137123 Sun Sep 02 17:04:15 +0000 2012,50.825449,-0.137123 Sun Sep 02 15:34:31 +0000 2012,50.8229428584,-0.1390289018 Fri Aug 31 17:33:20 +0000 2012,50.8291396274,-0.133923449 Fri Aug 31 09:20:04 +0000 2012,50.8311581116,-0.1335176435 Tue Aug 28 20:44:32 +0000 2012,41.8844650304,-87.6257600109 Mon Aug 27 13:57:24 +0000 2012,41.8844650304,-87.6257600109 Sat Aug 25 18:45:51 +0000 2012,41.8851594291,-87.6232355833 Wed Aug 22 12:32:45 +0000 2012,50.824415,-0.134691 Tue Aug 21 11:39:46 +0000 2012,50.8262027605,-0.1381123066 Mon Aug 20 11:01:28 +0000 2012,51.535132,-0.069309 Fri Aug 17 12:03:40 +0000 2012,50.8262027605,-0.1381123066 Sat Aug 11 16:08:13 +0000 2012,50.826646,-0.136921 Fri Aug 10 14:25:15 +0000 2012,50.8262027605,-0.1381123066 Wed Aug 08 11:51:45 +0000 2012,50.8262027605,-0.1381123066 Tue Aug 07 15:45:49 +0000 2012,50.8262027605,-0.1381123066 Fri Aug 03 16:38:55 +0000 2012,50.8262027605,-0.1381123066 Fri Aug 03 14:33:04 +0000 2012,50.8262027605,-0.1381123066 Sat Jul 28 14:57:52 +0000 2012,50.825449,-0.137123 Sat Jul 28 12:09:01 +0000 2012,50.828404,-0.137435 Thu Jul 26 17:17:22 +0000 2012,50.8266230357,-0.1367429505 Tue Jul 24 15:07:39 +0000 2012,50.8262027605,-0.1381123066 Mon Jul 23 12:25:35 +0000 2012,50.823104,-0.139515 Sat Jul 21 12:46:25 +0000 2012,50.827943,-0.136033 Fri Jul 20 13:21:41 +0000 2012,50.8262027605,-0.1381123066 Mon Jul 16 19:28:01 +0000 2012,50.825449,-0.137123 Sun Jul 15 10:48:44 +0000 2012,51.4714930776,-0.4883337021 Sat Jul 14 23:08:27 +0000 2012,41.974037,-87.890239 Tue Jul 10 13:44:08 +0000 2012,30.2655234842,-97.7385378752 Mon Jul 09 19:32:48 +0000 2012,30.2655234842,-97.7385378752 Mon Jul 09 14:40:21 +0000 2012,30.2656095537,-97.7385592461 Sat Jul 07 15:08:12 +0000 2012,51.4726745412,-0.4817537462 Fri Jun 29 10:55:03 +0000 2012,50.8262027605,-0.1381123066 Wed Jun 20 10:23:29 +0000 2012,51.488197,-0.120692 Mon Jun 18 12:12:01 +0000 2012,50.8262027605,-0.1381123066 Mon Jun 18 12:02:43 +0000 2012,50.8262027605,-0.1381123066 Sat Jun 16 15:51:15 +0000 2012,50.8244773427,-0.1387893509 Sat Jun 16 15:10:29 +0000 2012,50.827972412,-0.136271402 Fri Jun 15 22:15:44 +0000 2012,50.947306,0.090209 Fri Jun 15 12:58:27 +0000 2012,50.947306,0.090209 Wed Jun 13 12:12:49 +0000 2012,50.822951,-0.136905 Mon Jun 11 14:05:50 +0000 2012,50.825977,-0.138339 Wed Jun 06 16:31:48 +0000 2012,51.50361668,-0.683839 Wed Jun 06 15:38:45 +0000 2012,51.50361668,-0.683839 Sat Jun 02 15:40:48 +0000 2012,50.825449,-0.137123 Fri Jun 01 13:29:40 +0000 2012,50.8262027605,-0.1381123066 Thu May 31 16:37:18 +0000 2012,50.8262027605,-0.1381123066 Wed May 30 14:58:46 +0000 2012,50.8262027605,-0.1381123066 Wed May 30 12:45:33 +0000 2012,50.8262027605,-0.1381123066 Wed May 30 12:32:27 +0000 2012,50.8262027605,-0.1381123066 Tue May 29 12:12:15 +0000 2012,50.8242644595,-0.1329624653 Tue May 29 08:12:24 +0000 2012,50.8307708894,-0.1330473622 Sun May 27 21:06:57 +0000 2012,47.5608179303,-52.70936785 Mon May 21 19:15:05 +0000 2012,50.824975,3.26387 Mon May 21 13:56:02 +0000 2012,51.0541040608,3.7238935404 Mon May 21 12:19:17 +0000 2012,51.055163,3.720835 Sat May 19 15:52:22 +0000 2012,50.821309,-0.1434404 Sat May 19 14:19:38 +0000 2012,50.822215,-0.154896 Sun May 13 14:08:33 +0000 2012,50.8244462443,-0.139321602 Sun May 13 13:29:30 +0000 2012,50.8192217888,-0.1411056519 Sat May 12 19:32:13 +0000 2012,50.820359,-0.14243 Sat May 12 17:51:57 +0000 2012,50.822623,-0.142676 Fri May 11 09:22:05 +0000 2012,52.366239,4.894655 Tue May 08 12:39:36 +0000 2012,50.8287188784,-0.1423922896 Sun May 06 20:38:27 +0000 2012,50.871762,0.011501 Fri May 04 14:35:37 +0000 2012,50.8262027605,-0.1381123066 Thu May 03 16:03:52 +0000 2012,50.8262027605,-0.1381123066 Thu May 03 12:05:08 +0000 2012,50.8242644595,-0.1329624653 Wed May 02 12:43:38 +0000 2012,50.8262027605,-0.1381123066 Tue May 01 14:50:47 +0000 2012,50.8244094849,-0.1399479955 Tue May 01 13:17:36 +0000 2012,50.8262027605,-0.1381123066 Tue May 01 12:01:59 +0000 2012,50.826779,-0.138462 Tue May 01 11:22:41 +0000 2012,50.8262027605,-0.1381123066 Mon Apr 30 15:58:14 +0000 2012,50.8262027605,-0.1381123066 Fri Apr 27 17:26:19 +0000 2012,50.825449,-0.137123 Thu Apr 26 12:44:54 +0000 2012,50.8262027605,-0.1381123066 Tue Apr 24 11:30:25 +0000 2012,50.8262027605,-0.1381123066 Sat Apr 21 14:37:59 +0000 2012,50.8244773427,-0.1387893509 Wed Apr 18 11:05:28 +0000 2012,51.514461,-0.15415 Tue Apr 17 11:38:39 +0000 2012,50.8262027605,-0.1381123066 Mon Apr 16 17:28:09 +0000 2012,50.825449,-0.137123 Fri Apr 13 17:35:30 +0000 2012,50.825449,-0.137123 Fri Apr 13 11:39:01 +0000 2012,50.8262027605,-0.1381123066 Thu Apr 12 20:59:46 +0000 2012,50.8284865994,-0.1406764984 Thu Apr 12 20:43:24 +0000 2012,50.8284865994,-0.1406764984 Thu Apr 12 12:38:06 +0000 2012,50.8262027605,-0.1381123066 Wed Apr 04 17:35:46 +0000 2012,50.829236,-0.130433 Wed Apr 04 11:20:06 +0000 2012,50.8262027605,-0.1381123066 Wed Mar 28 19:51:57 +0000 2012,50.82533,-0.1371919 Wed Mar 28 17:41:06 +0000 2012,50.8266230357,-0.1367429505 Sat Mar 24 15:24:22 +0000 2012,50.82578,-0.139591 Sat Mar 24 14:42:14 +0000 2012,50.8244773427,-0.1387893509 Thu Mar 22 20:33:36 +0000 2012,50.821049,-0.140416 Thu Mar 15 16:00:20 +0000 2012,32.8975517297,-97.0442533493 Wed Mar 14 15:41:13 +0000 2012,30.265426,-97.740498 Tue Mar 13 19:52:43 +0000 2012,30.2647199679,-97.7443528175 Tue Mar 13 16:29:12 +0000 2012,30.2653850259,-97.7383099888 Mon Mar 12 02:03:53 +0000 2012,30.2669212002,-97.745683415 Sun Mar 11 17:45:31 +0000 2012,30.2626071693,-97.739803791 Sun Mar 11 15:18:53 +0000 2012,30.2647199679,-97.7443528175 Fri Mar 09 15:11:51 +0000 2012,30.2671521557,-97.7396624407 Mon Mar 05 10:56:37 +0000 2012,50.8262027605,-0.1381123066 Thu Mar 01 09:55:16 +0000 2012,50.8304057758,-0.1329698575 Wed Feb 22 23:56:59 +0000 2012,-33.8782765912,151.221249511 Wed Feb 22 02:00:43 +0000 2012,-41.328228677,174.809947014 Thu Feb 16 01:13:27 +0000 2012,-41.2890508786,174.777774995 Wed Feb 15 21:39:06 +0000 2012,-41.2893031956,174.777374268 Wed Feb 15 18:50:42 +0000 2012,-41.2893031956,174.777374268 Wed Feb 15 02:10:18 +0000 2012,-41.29336192,174.776485 Mon Feb 13 04:07:07 +0000 2012,-41.2893031956,174.777374268 Mon Feb 13 03:36:49 +0000 2012,-41.2924914456,174.776140451 Mon Feb 13 03:00:13 +0000 2012,-41.293314,174.776395 Mon Feb 13 02:40:18 +0000 2012,-41.2934345895,174.775958061 Mon Feb 13 01:22:04 +0000 2012,-41.2939726591,174.775840044 Sat Feb 11 23:39:04 +0000 2012,-36.405247,174.65600431 Sat Feb 11 07:32:16 +0000 2012,-36.405247,174.65600431 Sat Feb 11 06:49:42 +0000 2012,-36.405247,174.65600431 Wed Feb 08 23:20:25 +0000 2012,-33.878302,151.221256 Sat Feb 04 11:14:52 +0000 2012,50.828205,-0.1378011703 Thu Feb 02 13:41:42 +0000 2012,50.8262027605,-0.1381123066 Wed Feb 01 16:57:16 +0000 2012,50.8262027605,-0.1381123066 Sat Jan 28 16:57:35 +0000 2012,50.827062,-0.135349 Sat Jan 28 15:55:49 +0000 2012,50.828295,-0.138769 Thu Jan 26 12:42:08 +0000 2012,50.8262027605,-0.1381123066 Mon Jan 23 12:34:45 +0000 2012,50.822219,-0.140802 Sun Jan 22 15:18:32 +0000 2012,50.825832,-0.142381 Sat Jan 21 14:27:51 +0000 2012,50.8213,-0.1409 Fri Jan 20 12:45:34 +0000 2012,51.9479484763,-0.5020558834 Thu Jan 19 20:49:09 +0000 2012,52.9556027724,-1.1504852772 Thu Jan 19 12:38:47 +0000 2012,52.954584773,-1.1563324928 Wed Jan 18 16:42:24 +0000 2012,52.954584773,-1.1563324928 Wed Jan 18 16:39:09 +0000 2012,52.954584773,-1.1563324928 Tue Jan 17 15:00:09 +0000 2012,50.8262027605,-0.1381123066 Mon Jan 16 10:03:12 +0000 2012,50.8303548561,-0.1329055827 Sat Jan 14 16:11:55 +0000 2012,50.824838842,-0.1516896486 Wed Jan 11 21:07:19 +0000 2012,51.522789913,-0.0784921646 Wed Jan 11 19:27:24 +0000 2012,51.5237223711,-0.0770612686 Sat Jan 07 14:49:09 +0000 2012,50.824424,-0.138875 ... Fri Apr 09 01:52:12 +0000 2010,47.4412234282,-122.3010026978 Fri Apr 09 00:00:15 +0000 2010,47.4432422071,-122.3010595342 Thu Apr 08 01:29:11 +0000 2010,47.6873506139,-122.3341637453 Wed Apr 07 00:16:03 +0000 2010,47.6109922102,-122.3480262842 Sun Apr 04 18:47:33 +0000 2010,47.7083958758,-122.3272574643 Sat Apr 03 18:06:54 +0000 2010,47.6687063559,-122.3942997359 Sat Apr 03 18:05:00 +0000 2010,47.6687063559,-122.3942997359
I guess it's kind of scary that you can do this, and is perhaps why Twitter doesn't let you export anyone's account, even if it is public. But returning to the issue of Grailbird being on Github, I imagine there would be people that would write code that uses Grailbird as an API to the archive data, to provide extensions that would display a map of where you've been over time for example, or an analysis of your friendship network, or a view on hashtags you've used, events you've been at etc.
I think from an archival perspective, it would be really useful to be able to receive something like a Tweet archive from a donor, and overlay functionality on top of it. The model of using the Web as a local application platform for this sort of archival content seems like it could be a growth area.
Inside Out Libraries
Peter Brantley tells a sad tale about where public library leadership is at, as we plunge headlong into the ebook future, that has been talked about for what seems like forever, and which is now upon us. It’s not pretty.
The general consensus among participants was that public libraries have two, maybe three years to establish their relevance in the digital realm, or risk fading from the central place they have long occupied in the world’s literary culture.
The fact that a bunch of big-wigs invited by IFLA were seemingly unable to find inspiration and reason to hope that public libraries will continue to exist is not surprising in the least I guess. I’m not sure that libraries were ever the center of the world’s literary culture. But for the sake of argument lets assume they were, and that now they’re increasingly not. Let us also assume that the economic landscape around ebooks is in incredible turmoil, and that there will continue to be sea changes in technologies, and people’s use of them in this area for the foreseeable future.
What can libraries do to stay relevant? I think part of the answer is: stop being libraries…well, sorta.
The HyperLocal
The most serious threat facing libraries does not come from publishers, we argued, but from e-book and digital media retailers like Amazon, Apple, and Google. While some IFLA staff protested that libraries are not in the business of competing with such companies, the library representatives stressed that they are. If public libraries can’t be better than Google or Amazon at something, then libraries will lose their relevance.
In my mind the thing that libraries have to offer, which these big corporations cannot, is authentic, local context for information about a community’s past, present and future. But in the past century or so libraries have focused on collecting mass produced objects, and sharing data about said objects. The mission of collecting hyper-local information has typically been a side task, that has fallen to special collections and archives. If I were invited to that IFLA meeting I would’ve said that libraries need to shift their orientation to caring more about the practices of archives and manuscript collections, by collecting unique, valued, at risk local materials, and adapting collection development and descriptive practices to the realities of more and more of this information being available as data.
As Mark Matienzo indicated (somewhat indirectly in Twitter) after I published this blog post, a lot of this work involves focusing less on hoarding items like books, and focusing more on the functions, services, and actions that public libraries want to document and engage with in their communities. Traditionally this orientation has been a strength area for archivists in their practice and theory of appraisal where:
… considerations … include how to meet the record-granting body’s organizational needs, how to uphold requirements of organizational accountability (be they legal, institutional, or determined by archival ethics), and how to meet the expectations of the record-using community. Wikipedia
I think this represents a pretty significant cognitive shift for library professionals, and would in fact take some doing. But perhaps that’s just because my exposure to archival theory in “library school” was pretty pathetic. Be that as it may here are some practical examples of growth areas for public libraries that I wish came up at the IFLA meeting.
Web Archiving
The Internet Archive and national libraries that are part of the International Internet Preservation Consortium don’t have the time, resources and often mandate to collect web content that are of interest at the local level. What if the tooling and expertise existed for public libraries to perform some of this work, and to have the results fed into larger aggregations of web archives?
Municipality Reports and Data
Increasing amounts of data are being collected as part of the daily working of our local governments. What if your public library had the resources to be a repository for this data? Yeah, I said the R word. But I’m not suggesting that public libraries get the expertise to set up Fedora instances with Hydra heads, or something. I’m thinking about approaches to allowing data to easily flow into an organization, where it is backed up, and made available in a clearinghouse manner similar to public.resource.org on the Web, for search engines to pick up. Perhaps even services like LibraryBox offer another lens to look at the opportunities that lie in this area.
Born Digital Manuscript Collections
Public libraries should be aggressively collecting the “papers” of local people who have had significant contributions to their communities. Increasingly, these aren’t paper at all, but are born digital content. For example: email correspondence, document archives, digital photograph collections. I think that librarians and archivists know, in theory, that this born digital content is out there, but the reality is it’s not flowing into the public library/archive. How can we change this? Efforts such as Personal Digital Archiving are important for two reasons: they help set up the right conditions for born digital collections to be donated, and they also make professionals think about how they would like to receive materials so that they are easier to process. Think more things like AIMS, training and tooling for both professionals and citizens.
Licensing
It’s not unusual for archives and special collections to have all sorts of donor gift agreements that place restrictions on how their donated materials can be used. To some extent needing to visit the collection, request it, and not being able to leave the room with it, has mitigated some of this special-snowflakism. But when things are online things change a bit. We need to normalize these agreements so that content can flow online, and be used online in clearer ways. What if we got donors to think about Creative Commons licenses when they donated materials? How can we make sure donated material can become a usable part of the Web
Persistence
We all know that things come and go on the Web. But it doesn’t need to be that way for everything on the Web. Libraries and archives have an opportunity to show how focusing on being a clearninghouse for data assets can allow for things to live persistently on the Web. Thinking about our URLs as identifiers for things we are taking care of is important. Practical strategies for achieving that are possible, and repeatable. What if public libraries were safe harbors for local content on the World Wide Web? This might sound hard to do, but I think it’s not as hard as people think.
Metrics
As libraries/archives make more local content available publicly on the Web it becomes important to track how this content is accessed and used online. Quick wins like Web analytics tools (Google Analytics) for seeing what is being accessed and from where. Seeing how content is cited in social media applications like Facebook, Twitter, Pinterest and Wikipedia is important for reporting on the value of online collections. But encouraging professionals to use this information to become part of the conversations is equally important. Good metrics are also essential for collection development purposes, seeing what content is of interest, and what is not.
Inside Out Libraries
So, no I don’t think public libraries need a new open source Overdrive. The ebook market will likely continue to take care of itself. I also am not really convinced we need some overarching organization like the Digital Public Library of America to serve as a single point of failure when the funding runs dry. We need distributed strategies for documenting our local communities, so that this information can take its rightful place on the Web, and be picked up by Google so that people can find it when they are on the other side of the world. Things will definitely keep changing, but I think libraries and archives need to invest in the Web as an enduring delivery platform for information.
I’ve never been before but I was so excited to read the call for the European Library Automation Group (ELAG) this year.
The theme of this year’s conference is ‘The INSIDE-OUT Library’. This theme was chosen at last year’s conference, because we concluded:
- Libraries have been focusing on bringing the world to their users. Now information is globally available.
- Libraries have been producing metadata for the same publications in parallel. Now they are faced with deduplicating redundancy.
- Libraries have been selecting things for their users. Now the users select things themselves.
- Libraries have been supporting users by indexing things locally. Now everything is being indexed in global, shared indexes.
Instead of being an OUTSIDE-IN library, libraries should try and stay relevant by shifting their paradigm 180 degrees. Instead of only helping users to find what is available globally, they should also focus on making local collections and production available to the world. Instead of doing the same thing everywhere, libraries should focus on making unique information accessible. Instead of focusing on information trapped in publications, libraries should try and give the world new views on knowledge.
This blog post is really just a somewhat shabby rephrasing of that call. Maybe IFLA could use some of the folks on the ELAG program commmittee at their next meeting about the future of public libraries? Hopefully 2013 will be a year I can make it to ELAG.
I expect public libraries will continue to exist, but there isn’t going to be some magical technical solution to their problems. Their future will be forged by each local relationship they make, which leads to them better documenting their place on the Web. We may not call these places public libraries at first, but that’s what they will be.
linkrot: use your illusion
Mike Giarlo wrote a bit last week about the issues of citing datasets on the Web with Digital Object Identifiers (DOI). It’s a really nice, concise characterization of why libraries and publishers have promoted and used the DOI, and indirect identifiers more generally. Mike defines indirect identifiers as
… identifiers that point at and resolve to other identifiers.
I might be reading between the lines a bit, but I think Mike is specifically talking about any identifier that has some documented or ad-hoc mechanism for turning it into a Web identifier, or URL. A quick look at the Wikipedia identifier category yields lots of these, many of which (but not all) can be expressed as a URI.
The reason why I liked Mike’s post so much is that he was able to neatly summarize the psychology that drives the use of indirect identifier technologies:
… cultural heritage organizations and publishers have done a pretty poor job of persisting their identifiers so far, partly because they didn’t grok the commitment they were undertaking, or because they weren’t deliberate about crafting sustainable URIs from the outset, or because they selected software with brittle URIs, or because they fell flat on some area of sustainability planning (financial, technical, or otherwise), and so because you can’t trust these organizations or their software with your identifiers, you should use this other infrastructure for minting and managing quote persistent unquote identifiers
Mike goes on to get to the heart of the problem, which is that indirect identifier technologies don’t solve the problem of broken links on the Web, they just push it elsewhere. The real problem of maintaining the indirect identifier when the actual URL changes becomes someone else’s problem. Out of sight, out of mind … except it’s not really out of sight right? Unless you don’t really care about the content you are putting online.
We all know that linkrot on the Web is a real thing. I would be putting my head in the sand if I were to say it wasn’t. But I would also be putting my head in the sand if I said that things don’t go missing from our brick and mortar libraries. But still, we should be able to do better than 1/2 the URLs in arXiv going dead right? I make a living as a web developer, I’m an occasional advocate for linked data, and I’m a big fan of the work Henry Thompson and David Orchard did for the W3C analyzing the use of alternate identifier schemes on the Web…so, admittedly, I’m a bit of a zealot when it comes to promoting URLs as identifiers, and taking the Web seriously as an information space.
Mike’s post actually kicked off what I thought was a useful Twitter conversation (yes they can happen), which left me contemplating the future of libraries and archives on (or in) the Web. Specifically, it got me thinking that perhaps libraries and archives of the not too distant future will be places that take special care in how they put content on the Web, so that it can be accessed over time, just like a traditional physical library or archive. The places where links and the content they reference are less likely to go dead will be the new libraries and archives. These may not be the same institutions we call libraries today. Just like today’s libraries, these new libraries may not necessarily be free to access. You may need to be part of some community to access them, or to pay some sort of subscription fee. But some of them, and I hope most, will be public assets.
So how to make this happen? What will it look like? Rather than advocating a particular identifier technology I think these new libraries need to think seriously about providing Terms of Service documents for their content services. I think these library ToS documents will do a few things.
- They will require the library to think seriously about the service they are providing. This will involve meetings, more meetings, power lunches, and likely lawyers. The outcome will be an organizational understanding of what the library is putting on the Web, and the commitment they are entering into with their users. It won’t simply be a matter of a web development team deciding to put up some new website…or take one down. This will likely be hard, but I think it’s getting easier all the time, as the importance of the Web as a publishing platform becomes more and more accepted, even in conservative organizations like libraries and archives.
- The ToS will address the institutions commitment for continued access to the content. This will involve a clear understanding of the URL namespaces that the library manages, and a statement about how they will be maintained over time. The Web has built in mechanisms for content moving from place to place (HTTP 301), and for when resources are removed (HTTP 410), so URLs don’t need to be written in stone. But the library needs to commit to how resources will redirect permanently to new locations, and for how long–and how they will be removed.
- The ToS will explicitly state the licensing associated with the content, preferably with Creative Commons licenses (hey I’m daydreaming here) so that it can be confidently used.
- Libraries and archives will develop a shared palette of ToS documents. Each institution won’t have it’s own special snowflake ToS that nobody reads. There will be some normative patterns for different types of libraries. They will be shared across consortia, and among peer institutions. Maybe they will be incorporated into, or reflect shared principles found in documents like ALA’s Library Bill of Rights or SAA’s Code of Ethics.
I guess some of this might be a bit reminiscent of the work that has gone into what makes a trusted repository. But I think a Terms of Service between a library/archive and its researcher is something a bit different. It’s more outward looking, less interested in certification and compliance and more interested in entering into and upholding a contract with the user of a collection.
As I was writing this post, Dan Brickley tweeted about a recent talk Tony Ageh (head of the archive development team at the BBC) gave at the recent Economies of the Commons conference. He spoke about his ideas for a future Digital Public Space, and the role that archives and organizations like the BBC play in helping create it.
Things no longer ‘need’ to disappear after a certain period of time. Material that once would have flourished only briefly before languishing under lock and key or even being thrown away — can now be made available forever. And our Licence Fee Payers increasingly expect this to be the way of things. We will soon need to have a very, *very* good reason for why anything at all disappears from view or is not permanently accessible in some way or other.
That is why the Digital Public Space has placed the continuing and permanent availability of all publicly-funded media, and its associated information, as the default and founding principle.
I think Tony and Mike are right. Cultural heritage organizations need to think more seriously, and more long term about the content they are putting on the Web. They need to put this thought into clear, and succinct contracts with their users. The organizations that do will be what we call libraries and archives tomorrow. I guess I need to start by getting my own house in order eh?
level 0 linked archival data
TLDR; lets see if we can share structured archival data better by adding HTML <link> elements that point at our EAD XML files.
A few weeks ago I attended a small meeting of DC museums, archives and libraries that were discussing what Linked Data means for Archives. Hillel Arnold and I took collaborative notes in Pirate Pad. For a good part of the time we went around the room talking about how we describe archival collections with various workflows using Encoded Archival Description (EAD), and how this was mostly working (or not).
Some good work has already been done imagining how Linked Data can transform archival description by the LOCAH (now Linking Lives) as well as the Social Networks and Archival Context project. I think tools like Editors’ Notes, CWRC Writer, and Google’s Research Pane could provide really useful models for how the work of an archivist could benefit from linking to external resources such as Wikipedia, dbpedia, VIAF, etc. But we really didn’t talk about that in too much detail. The focus instead was on various tools people used in their EAD workflows: Archivists’ Toolkit, Oxygen, ExistDB, Access databases, etc … and the hope that Archives Space could possibly improve matters. We did touch briefly on what it means to make finding aids available on the Web, but not in a very satisfactory way.
I was really struck by how everyone was using EAD, even if their tools were different. I was also left with the lingering suspicion that not much of this EAD data was linked to from the HTML presentation of the finding aid. After some conversations it was also my understanding that even after 20 years of work on EAD, there is not a listing of websites that make EAD finding aids available. It seems particularly sad that institutions have invested a lot of time and effort in putting EAD into practice, and yet we still aren’t really sharing them very well with each other.
So in a bit of a fit of frustration I did some hacking to see if I could use Google and ArchiveGrid to identify websites that serve up finding aids either as HTML or as EAD XML. I wanted to:
- Get a list of websites that made HTML and EAD XML finding aids available. We can rely on Google to index the Web, but maybe we could index the archival web a bit better ourselves if we had a better understanding of where the EAD data was available. The idea is that this initial list could be used to bootstrap a list of websites making EAD finding aids available in the Wikipedia entry for EAD.
- To see which websites have HTML representations that link to an EAD XML representation. The rationale here is to encourage a very simple best practice for linking to structured archival data when it is available. More on that below.
I was able to identify 201 hosts that served up finding aids either as HTML or XML. You should be able to see them here in this spreadsheet. I also collected URLs for finding aids (both HTML and XML) that I was able to locate, which can be seen in this JSON file.
With the URLs in hand I wrote a little script to examine which of the 156 hosts serving up HTML representations of finding aids had a link to an XML EAD document. I looked for a very simple kind of link that was popularized by the RSS and Atom syndication community for autodiscovery of blog feeds. A <link> tag that has a rel attribute of alternate and a type attribute set to application/xml. Out of the 156 websites serving up HTML representations of finding aids I could only find two websites that used this link pattern: Princeton University and Emory University.
For example if you view the HTML source for the Einstein Collection finding aid at Princeton you’ll see this link:
<link rel="alternate" type="application/xml" href="http://findingaids.princeton.edu/collections/C1022.xml" /> |
Similarly the finding aid for the Salman Rushdie collection at Emory University has this link:
<link rel="alternate" type="application/xml" href="/documents/rushdie1000/EAD/" /> |
As the title of this blog post suggests, I’m calling this pattern level 0 linked data. Linked Data purists would probably say this isn’t Linked Data at all since it doesn’t involve an RDF serialization. And I guess they would be right. But it does express a graph of HTML and EAD data that is linked, and it serves a real need. If you are interested in Linked Data and archives I encourage you to add these links to your HTML finding aids today.
So why is are these links important?
The main reason is they are found in HTML documents, which are the representations that matter most on the Web. HTML documents are read by people. They are hypertext documents that link to and from other places on an archives website and elswewhere on the Web at large. They are well understood technically by the Web development community…if you hire a developer they might have strong feelings about using PHP or Ruby, but they will know HTML backwards and forwards. They are crawled and indexed by search engine bots so that researchers around the world can discover our collections. They are cited in social environments like Twitter, Facebook, blog posts, etc. We have a responsibility to create stable homes (URLs) for our archival descriptions that fit into the Web.
The other reason is these links are important is that they make our investment in EAD visible on the Web for anyone who is looking. Nobody but ArchiveGrid actively crawl EAD XML data. They are the only ones that can find them, because they have been told where they are. If we did a better job of advertising the availability of our EAD documents I think we would see more tools and services around them. ArchiveGrid is a good example of the sort of tool that could be built on top of a web of EAD data. But what about archival collections in your local area? Perhaps it would be useful to have a service that let you look across the archival holdings of institutions in a consortium you belong to. Or perhaps you might want to create an alerting service that lets researchers know what new archival collections are being made available. Or maybe you need to collaborate with archives in a specific domain, and need tools that provide a custom experience for that distributed collection. I imagine there would be lots of ideas for apps if there were just a teensy bit more thought put into how finding aids (both the HTML and the XML) are put on the Web, and how we shared information about their availability.
Going forward I think HTML5 microdata and RDFa present some excellent opportunities for Linked Data representations of finding aids. Especially when you consider some of the vocabulary development being done around them; as well as some of the work being done by Tim Sherratt on using linked data to create new user experiences around archival data. But if your institution has already invested in creating EAD documents I think trying this link pattern with data you already have could be a good first step towards introducing linked data into your archive. I hope it is a first baby step that archives can take in merging some of the structured data found in the EAD XML document into the HTML they publish about their collections.
I’m planning on getting the list of EAD publishers into the Wikipedia article for EAD, and putting out a call for others to add their website if it is missing. I also think that a simple crawling and aggregation service that use the links in some fashion could also encourage more linking. A lot of this blog post has been mental preparation for my involvement in an IMLS funded project run out of Tufts that will be looking at Linked Archival Metadata, which is about to be kicked off this winter. If you’ve read this far, and have any thoughts or suggestions about this I’d enjoy hearing them either here, on Twitter or via email.
who creates the LCNAF (part 2)
I ended my A Look at Who Creates the LCNAF post with a hunch that the Library of Congress Name Authority File is increasingly supported by particpants in the Name Authority Cooperative (NACO) rather than by the Library of Congress themself. It didn’t occur to me until a few days later that I missed a pretty obvious opportunity to graph the number of records created by LC compared with all the other members of the collective. So, here it is:
It looks like this has been a trend since about 1996 or so. I think it validates the cooperative aspect of the PCC and NACO. Not that it needs any validating. It’s just nice to see libraries and librarians working together to build something. I guess the name Library of Congress Name Authority File is also increasingly ironic…
Update: thanks to Kevin Ford (who emailed me privately) it seems that LC has been quite aware of this trend, and highlighted the event in 1996 when NACO members began contributing more records than LC with a press release.
Always Already New
Always Already New: Media, History, And The Data Of Culture by Lisa Gitelman
My rating: 3 of 5 stars
I enjoyed this book, mainly for the author’s technique of exploring what media means in our culture by using two examples, separated in time: the phonograph and the Internet. She admits that in some ways this amounts to comparing apples to oranges, and there is definitely a creative tension in the book. Gitelman’s emphasis is not that media technologies change society and culture, but that a technology is introduced and is in turn shaped by its particular social and historical context, which then reshapes society and culture.
I define media as socially realized structures of communication, where structures include both technological forms and their associated protocols, and where communication is a cultural practice, a ritualized collocation of different people on the same mental map, sharing or engaged with popular ontologies of representation. As such, media are unique and complicated historical subjects.
It’s tempting to talk about media technologies as if their ultimate use is somehow inevitable. For example, Gitelman discusses how the initial commercial placement of the phonograph centered largely around the idea that it would transform dictation and the office. Early demonstrations intended to increase sales of the device focused on recording and playback, rather than simply playback. They didn’t initially see the market for recorded music, which would so transform the device. To some extent we’ve cynically come to expect this out of marketing and “evangelism” about media technologies all the time. But this mode of thinking is also present in purely technical discussions, which don’t account for the placement of the technology in a particular social context.
Getting a sense of the social context you are in the middle of, as opposed to one you one you are historically removed from, presents some challenges. I think this difficulty is more evident in the second part of the book which focuses on the Internet and the World Wide Web against a backdrop of libraries and bibliography. Like many others I imagine, my knowledge of JCR Licklider’s influence on the development of ARPAnet, and the Internet was largely culled from Where Wizards Stay Up Late. I had no idea, until reading Always Already New, that Licklider contracted with the Council on Library Resources (now Council on Library and Information Resources) to write a report Libraries of the Future on the topic of how computing would change libraries.
I enjoyed the discussion of the role that the Request for Comment (RFC) played on the Internet. How these documents that were initially shared via the post, helped bootstrap the technologies that would create the Internet that allowed them to be shared as electronic documents or text. I didn’t know about the RFC-Online project that Jon Postel started right before his death, to recover the earliest RFCs that had been already lost. Gitelman’s study of linking, citation and “publishing” on the Web was also really enjoyable, mainly because of her orientation to these topics:
I will argue that far from making history impossible, the interpretive space of the World Wide Web can prompt history in exciting new ways.
All this being said, I finished the book with the sneaking feeling that I needed to reread it. Gitelman’s thesis was subtle enough that it was only when I got to the end that I felt like I understood it: the strange loop that thinking and media participate in, and how difficult (and yet fruitful) it is to talk about media and their social context. Maybe this was also partly the effect of reading it on a Kindle
learning from people that do
Anil Dash recently wrote a nice piece about the need for what he calls a Hi-Tech Vo-tech in the technology sector. If you are not familiar with it already, Vo-Tech is shorthand in the US for Vocational-technical school, which provide focused training in specific areas, often on a part time basis. The Vo-Tech experience is markedly different from the typical 4 year university experience, which tends to be focused more on theory than practice.
I totally agree.
But if you are looking to work as a software developer, and to help build this amazing information space we call the World Wide Web, you don’t need to wait for this dream of a better high school curriculum for computer programming, or Hi-Tech Vo-Techs to come to your town. I don’t want to minimize the effort involved in finding your way into the workplace…it’s hard, especially when there is competition from “qualified” candidates, and the skill sets seem to be constantly shifting. But here are some relatively simple steps you can take to get started.
Look at Job Ads
Go to the CraigsList for your area, look at what jobs are available under the internet engineers and software / qa / dba sections. I suggest Craigslist because of their local flavor, and the low cost to advertise, which typically means the jobs are at smaller companies who are less interested in finding someone with the right college degree, and more interested in finding someone who can get things done. Look for jobs that focus on what you can do rather than schooling. Don’t apply for any of the jobs just yet. Note down the tools they want people to know: computer languages, operating systems, web frameworks, etc. Research them on Wikipedia. Focus on tools that seem to pop up a lot, are opensource, and can be downloaded and used without cost. You don’t need to do anything with them just yet though.
Go To User Group Meetings
I say opensource because opensource tools often have open communities around them. You should be able to find user groups in your area where people present on how they use these tools at their place of work. You might have to drive a while, or take a long bus/train ride — but it’s worth it. To find the meetings do some searches by technology and location on Meetup. Alternatively you can Google for whatever the technology is + “user group” + your area (e.g. Philadelphia) and go through a few pages of results. At a user group meeting you will not only learn about the details of the technology, but you will meet actual, real people who are using it. There are often subtle differences in the cultures and communities of practice around software tools. Some user groups will feel more comfortable than others. Pay attention to your gut reactions–they are indicators of how much you would like a job working with the technology, and the people who like it. If you get a bad vibe, don’t take it personally, try another meeting. Finding a job is often a matter of who you know, not what you know … and user groups are a great place to get to know people working in the software development field. There’s no online substitute for meeting people in real life.
Use Social Networks
At user group meetings you meet people who you can learn from. See if they have a blog, are on Twitter or Facebook. Maybe they use a social bookmarking tool you can follow. Or perhaps there are email discussion lists you can subscribe to. It’s not stalking, these people are your mentors, learn from them. Take a dip into sites like Hacker News or Programming Reddit. Watch the trends, you aren’t being a fanboy/girl, you are learning about what people care about in the field. Don’t feel bad if it’s overwhelming (it’s overwhelming to “experts” too), focus on what seems interesting. Also, cultivate your own online identity by posting stuff that you are interested in, or have questions about. Stay positive, and try not to bash things: people (and potential employers) are watching you the same way you are watching them.
Read
Sometimes the speakers at User Group meetings will also be authors of books. You will see books reviewed on sites like Hacker News. People you follow may mention the books they read, or have accounts on sites like GoodReads. See if a library or a bookstore has them, and go skim them. Buy or borrow the ones you like. Take notes about them online, so people can see your interests. Get a Google Reader account and follow blogs related to tools you would like to use. Look for tools that have approachable/readable tutorials. Try out the examples, and get a feel for how well the theory of the tutorial translates into practice. If tools don’t install or seem to work the way they are described, don’t feel like you did something wrong…move on to tools that work more smoothly, and fit your brain better. The benefit to focusing on opensource projects is that you will find more content about them online. You can can read code. Reading the source code for Ruby or GoLang is definitely not for the faint of heart, though it’s nice you can do it. It’s more important that you look at code that uses these tools. Go to GitHub and see what projects there are that use the tool. Browse the source online, or clone the repositories to your workstation. See if you can help out with some low hanging fruit tasks in their issue queue.
Find a Niche
You are probably interested in things other than programming. For example I like libraries and archives, and the cultural heritage sector. I’ve found a virtual community of software developers in this area called code4lib, which helps me learn more about new projects, tools in the field, and is a way to get to know people. You may be surprised to find a similar community around something you are interested in: be it astrophysics, cartoons, music, maps, real estate, etc. If you don’t find one, maybe think about starting one up–you might be surprised by how many people turn up. Sometimes there are collaborative projects that need your help like Wikipedia, Open Street Map where the ability to automate mundane tasks is needed. You might not get paid for this work, but it will broaden your circle of contacts, deepen your technical skills, will build your self confidence, and will be something to put on your resume. The key thing that finding a niche can do is make your job search a bit easier, since technology skills cut across domains. You will also find that your niche has a particular set of tools that it likes to use. These typically aren’t hard and fast rules about using X instead of Y, but are norms. Pay attention to them, and learn about things that interest you.
Be Confident
I don’t mean to imply any of this is easy. It can be extremely difficult to get out of your comfort zone and explore things you don’t know. But you will be rewarded for your efforts, by learning from people who actually do things in the world. I’ve worked with some really excellent software developers that didn’t have a compsci degree, and some that I wasn’t even sure if they graduated high school. Sometimes I wonder if I even graduated from high school. So be confident in your ability to learn and do this thing we call software development. Show that you are humble about what you don’t know, and that you are hungry to learn it. Above all, don’t buy into the cult of the “real programmer” … she doesn’t exist. There are just people to learn from, and if you are doing it right, you never stop learning.









