The Internet Archive will soon stop honoring robots.txt files

One of the fixtures of the modern web is the robots.txt file—a file intended to notify web-crawling robots what parts of web sites are off-limits to them, so as to avoid reindexing duplicate content or bandwidth-intensive large files. A number of search engines, such as Google, honor robots.txt restrictions, though there’s no technical reason they have to.
Until recently, the Internet Archive has also been honoring the instructions from robots.txt files—but this is just about to change. On the Internet Archive’s announcement blog, Mark Graham explains that robots.txt’s search-indexing functionality is increasingly at odds with the site’s mission to archive the web as it was.
Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes.  Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files.  We have also seen an upsurge

Original URL:

Original article evolves with new website design, enhanced services

The granddaddy of internet public domain archives,, has recently been upping its game after years in more or less the same venerable, trusty, but slightly fusty format. As announced, “the new version of the site has been evolving over the past 6 months in response to the feedback we’ve received from thousands of […]

The post evolves with new website design, enhanced services appeared first on TeleRead: News and views on e-books, libraries, publishing and related topics.

Original URL:

Original article

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑

%d bloggers like this: