Web Archives’ Photoshop Moment
Joy Reid is why we need multiple web archives.

Meme credit: Michael Nelson.

“Web archives are going to be weaponized to alter existing trustworthy information and to inject fake, untrustworthy information into the context.”

This was computer scientist Michael Nelson at the recent National Forum on Ethics & Archiving the Web, organized by Rhizome and Documenting the Now, and hosted at the New Museum. His words rang true; throughout the conference, panelists had spoken of archiving in high-stakes, adversarial environments where the content of web archives has serious effects on people’s lives, making them a ripe target for manipulation.

Indeed, only a few minutes previously, Ada Lerner had finished summarizing their paper (co-authored with Tadayoshi Kohno and Franziska Roesner) describing sucessful strategies for manipulating content held in the Internet Archive’s Wayback Machine, and thereby manipulating the historical record. (Lerner had shared their paper with the IA, and the organization acted quickly to address all of the potential compromises addressed therein.)

Video from Ethics & Archiving the Web. Lerner’s presentation starts at 1:14; Nelson’s at 1:43:

The concerns that Nelson, Lerner, and others raised would seem to lend credence, then, Joy Reid’s recent claim that she may have been the victim of a Wayback Machine hacker. But in two blog posts yesterday, the Internet Archive and Nelson have cast serious doubt on that idea.  

To back up for a moment: last December, Twitter user @Jamie_Maz, unearthed a series of homophobic posts from Reid’s old blog. Subsequently, she apologized; the apology was largely well-received by liberal media outlets. Last week, though, @Jamie_Maz unearthed further posts from Reid’s blog using the Wayback Machine. These were far worse, and Reid denied responsibility, claiming that she was the victim of a malicious hacker, and that she had requested that the posts in question be removed from the Wayback Machine and Google.

Yesterday, the Internet Archive revealed that Reid’s lawyers had contacted them back in December, at the time of the original apology. Their response was unequivocal:

When we reviewed the archives, we found nothing to indicate tampering or hacking of the Wayback Machine versions. At least some of the examples of allegedly fraudulent posts provided to us had been archived at different dates and by different entities.

We let Reid’s lawyers know that the information provided was not sufficient for us to verify claims of manipulation. Consequently, and due to Reid’s being a journalist (a very high-profile one, at that) and the journalistic nature of the blog archives, we declined to take down the archives.

Reid and her lawyers apparently found a workaround, though; they added a robots.txt exclusion to the site, a short text file hosted on a given website which includes instructions to web crawlers, such as those used by Google and the Internet Archive to automatically capture content from the web. The handling of robots.txt exclusions has been another hot topic in web archiving, but the IA’s current policy is to stop replaying captures from the Wayback Machine if the live site disallows crawling. It’s one of the few ways in which websites can opt out of being archived.

This has meant that, for the general public, @Jamie_Maz’s recent claims had been unverifiable. But, as Michael Nelson pointed out in another post yesterday, there is more than one web archive. He was able to source a number of the homophobic posts unearthed last week in the web archives of the Library of Congress, which does not follow the robots.txt removal policy: 

Nelson concludes:

In summary, of the many examples that @Jamie_Maz provides, I can find five copies in the Library of Congress's web archive.  These crawls were probably performed on behalf of the Library of Congress by the Internet Archive (for election-based coverage); even though there are many different (and independent) web archives now, in 2006 the Internet Archive was pretty much the only game in town.  Even though these mementos are not independent observations, there is no plausible scenario for these copies to have been hacked in multiple web archives or at the original blog 10+ years ago. 

In short, as Nelson argues: this is why we need multiple web archives.

This post originally indicated that web captures are removed from the Wayback Machine if there is a robots.txt exclusion on the live version of a given site. It has been updated to reflect the Internet Archive’s policy to stop replaying such captures on the Wayback Machine, not to delete them.