One of the fixtures of the modern web is the robots.txt file—a file intended to notify web-crawling robots what parts of web sites are off-limits to them, so as to avoid reindexing duplicate content or bandwidth-intensive large files. A number of search engines, such as Google, honor robots.txt restrictions, though there’s no technical reason they have to.
Until recently, the Internet Archive has also been honoring the instructions from robots.txt files—but this is just about to change. On the Internet Archive’s announcement blog, Mark Graham explains that robots.txt’s search-indexing functionality is increasingly at odds with the site’s mission to archive the web as it was.
Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge
Original URL: https://teleread.org/2017/04/24/the-internet-archive-will-soon-stop-honoring-robots-txt-files/