Keeper of Expired Web Pages Is Sued Because Archive Was Used in Another Suit

July 13, 2005

Type: discussion

http://www.nytimes.com/2005/07/13/technology/13suit.html?eiP88&en77b4b470d4593e0&ex78907200&adxnnl=6&partner=rssnyt&emc=rss&adxnnlx21290900-rouEQzAGo8GqbZlwwjPI6A

Keeper of Expired Web Pages Is Sued Because Archive Was Used in Another Suit

By TOM ZELLER Jr.
Published: July 13, 2005

The Internet Archive was created in 1996 as the institutional memory of
the online world, storing snapshots of ever-changing Web sites and
collecting other multimedia artifacts. Now the nonprofit archive is on the
defensive in a legal case that represents a strange turn in the debate
over copyrights in the digital age.

Beyond its utility for Internet historians, the Web page database,
searchable with a form called the Wayback Machine, is also routinely used
by intellectual property lawyers to help learn, for example, when and how
a trademark might have been historically used or violated.

That is what brought the Philadelphia law firm of Harding Earley Follmer &
Frailey to the Wayback Machine two years ago. The firm was defending
Health Advocate, a company in suburban Philadelphia that helps patients
resolve health care and insurance disputes, against a trademark action
brought by a similarly named competitor.

In preparing the case, representatives of Earley Follmer used the Wayback
Machine to turn up old Web pages - some dating to 1999 - originally posted
by the plaintiff, Healthcare Advocates of Philadelphia.

Last week Healthcare Advocates sued both the Harding Earley firm and the
Internet Archive, saying the access to its old Web pages, stored in the
Internet Archive's database, was unauthorized and illegal.

The lawsuit, filed in Federal District Court in Philadelphia, seeks
unspecified damages for copyright infringement and violations of two
federal laws: the Digital Millennium Copyright Act and the Computer Fraud
and Abuse Act.

"The firm at issue professes to be expert in Internet law and intellectual
property law," said Scott S. Christie, a lawyer at the Newark firm of
McCarter & English, which is representing Healthcare Advocates. "You would
think, of anyone, they would know better."

But John Earley, a member of the firm being sued, said he was not
surprised by the action, because Healthcare Advocates had tried to amend
similar charges to its original suit against Health Advocate, but the
judge denied the motion. Mr. Earley called the action baseless, adding:
"It's a rather strange one, too, because Wayback is used every day in
trademark law. It's a common tool."

The Internet Archive uses Web-crawling "bot" programs to make copies of
publicly accessible sites on a periodic, automated basis. Those copies are
then stored on the archive's servers for later recall using the Wayback
Machine.

The archive's repository now has approximately one petabyte - roughly one
million gigabytes - worth of historical Web site content, much of which
would have been lost as Web site owners deleted, changed and otherwise
updated their sites.

The suit contends, however, that representatives of Harding Earley should
not have been able to view the old Healthcare Advocates Web pages - even
though they now reside on the archive's servers - because the company,
shortly after filing its suit against Health Advocate, had placed a text
file on its own servers designed to tell the Wayback Machine to block
public access to the historical versions of the site.

Under popular Web convention, such a file - known as robots.txt - dictates
what parts of a site can be examined for indexing in search engines or
storage in archives.

Most search engines program their Web crawlers to recognize a robots.txt
file, and follow its commands. The Internet Archive goes a step further,
allowing Web site administrators to use the robots.txt file to control the
archiving of current content, as well as block access to any older
versions already stored in the archive's database before a robots.txt file
was put in place.

But on at least two dates in July 2003, the suit states, Web logs at
Healthcare Advocates indicated that someone at Harding Earley, using the
Wayback Machine, made hundreds of rapid-fire requests for the old versions
of the Web site. In most cases, the robot.txt blocked the request. But in
92 instances, the suit states, it appears to have failed, allowing access
to the archived pages.

In so doing, the suit claims, the law firm violated the Digital Millennium
Copyright Act, which prohibits the circumventing of "technological
measures" designed to protect copyrighted materials. The suit further
contends that among other violations, the firm violated copyright by
gathering, storing and transmitting the archived pages as part of the
earlier trademark litigation.

The Internet Archive, meanwhile, is accused of breach of contract and
fiduciary duty, negligence and other charges for failing to honor the
robots.txt file and allowing the archived pages to be viewed.

Brewster Kahle, the director and a founder of the Internet Archive, was
unavailable for comment, and no one at the archive was willing to talk
about the case - although Beatrice Murch, Mr. Kahle's assistant and a
development coordinator, said the organization had not yet been formally
served with the suit.

Mr. Earley, the lawyer whose firm is named along with the archive,
however, said no breach was ever made. "We wouldn't know how to, in
effect, bypass a block." he said.

Even if they had, it is unclear that any laws would have been broken.

"First of all, robots.txt is a voluntary mechanism," said Martijn Koster,
a Dutch software engineer and the author of a comprehensive tutorial on
the robots.txt convention (robotstxt.org). "It is designed to let Web site
owners communicate their wishes to cooperating robots. Robots can ignore
robots.txt."

William F. Patry, an intellectual property lawyer with Thelen Reid &
Priest in New York and a former Congressional copyright counsel, said that
violations of the copyright act and other statutes would be extremely hard
to prove in this case.

He said that the robots.txt file is part of an entirely voluntary system,
and that no real contract exists between the nonprofit Internet Archive
and any of the historical Web sites it preserves.

"The archive here, they were being the good guys," Mr. Patry said,
referring to the archive's recognition of robots.txt commands. "They
didn't have to do that."

Mr. Patry also noted that despite Healthcare Advocates' desire to prevent
people from seeing its old pages now, the archived pages were once posted
openly by the company. He asserted that gathering them as part of fending
off a lawsuit fell well within the bounds of fair use.

Whatever the circumstances behind the access, Mr. Patry said, the sole
result "is that information that they had formerly made publicly available
didn't stay hidden."

Jessica Ivins
ema. [email protected]