The first rule of web crawling is you do not harm the website. The second rule of web crawling is you do NOT harm the website. We’re supporters of the democratization of web data, but not at the expense of the website’s owners.
In this post we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers.
Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs, WALL-Es, and R2-D2s rather than an apocalyptic wasteland of HAL 9000s, T-1000s, and Megatrons.
A polite crawler respects robots.txtA polite crawler never degrades a website’s performanceA polite crawler identifies its creator with contact informationA polite crawler is not a pain in the buttocks of system administrators
robots.txt
Always make sure that your crawler follows the rules defined in the website’s robots.txt file. This file is usually available at the root of a website (www.example.com/robots.txt) and
Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/1wZvBmoUZww/