Web scraping is the extraction of data from web pages. But most web pages aren’t designed to accomodate automated data extraction; instead, they’re designed to be easily read by humans, with colors and fonts and pictures and all sorts of junk. This makes web scraping tricky. There are two predominant techniques for web scraping: HTML parsing and browser automation.
Before going on, I must confess a shameful secret: I don’t understand HTML very well. It’s just too ugly to get me interested. Every so often I’ll try to sit down and read about HTML, and I usually get bored and quit right around the time they get to unordered lists (). Why couldn’t they just use S-expressions? Do the brackets and explicit close tags actually add anything? Whatever, it doesn’t matter. The bottom line is that I hate dealing with HTML and I’d prefer to avoid it if I possibly can.
Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/wb2JL12Hqn4/web-scraping.html