I launched my startup this week. On a very modest scale, no debauched parties or penthouse offices on scads of VC money, just me and my laptop in my dad’s living room, popping open a bottle of home made cider to mark the event.
Language Spy is the tangible fruit of a seven or eight year side project, creating a searchable corpus of political language. It’s driven by a pair of Raspberry Pis doing the numbercrunching and uploading data to Google Cloud Storage buckets from whence the site is served by Google App Engine.
You can see events unfolding through the words used about them, for example the correlation between “Hillary Clinton” and “Email” in the last week of US politics. If like me you’re a news junkie, it’s compelling viewing.
Unfortunately though if you click the link as I write this, you’ll see a Google App Engine quota exceeded message. The site won’t work, because I have reached the point at which I can’t afford the traffic it’s serving and it has exceeded my daily budget.
This traffic spike would be no problem if it were generated by real site users as then I’d be able to monetise the traffic, but sadly it isn’t. Instead it’s generated by GoogleBot. That’s right, being indexed by a search engine has taken my site down. The bot looks at the site, decides it’s on some very fast infrastructure, and issues millions of requests per hour.
When I examine my problem, it becomes clear that it has several aspects:
- It’s a language analysis site, so it has a *lot* of pages for the spider to crawl.
- Being a language analysis site there are no pieces of language I can exclude using robots.txt, so I can’t reduce the load by conventional means. How do you decide which language is more important than other pieces? You can’t, at least not when your aim is to have it all open for analysis.
- I can’t tell Google to slow down a little, as where I’d expect to be able to do this in Webmaster Tools I get a “Your site has been assigned special crawl rate settings. You will not be able to change the crawl rate.” message. I see this as the sticking point, if I could restrict Googlebot’s rate I’d be able to keep the site running and take the hit of not being so well indexed.
This means that the spider is eating through my daily Google App Engine
quota very quickly indeed. I will find myself gaining a hundred
instance hours in a very short time indeed as GAE spins up loads of new
instances to deal with the spider. Pretty
soon the site hits its daily quota and goes down. I could keep it going by feeding in more money, but I’d have to put hundreds of dollars a day into it and with no end in sight I am not made of money.
Right now my only hope lies with a crawl issue report I filed with the Webmaster Tools team, if they can give me control over my indexing rate I’ll be good to go. But I can’t say when they’ll come back to me if ever, so I may just have to come up with a Plan B.
Is there a moral to this story? Perhaps it’s a cautionary tale for a small startup tempted to use cloud hosting. Google Cloud Storage has proved very cost-effective for a huge language database, but the sting in the tail has turned out to be how GoogleBot behaves when it sees a cloud server and how per-instance billing on App Engine handles unexpected traffic surges. The fact that it’s Google who are causing me to use up my budget with Google is annoying but not sinister, however neither giving me the option to limit my GAE instance count nor slow down the crawl rate doesn’t leave me as the happiest of customers.
So yes, I’ve launched a startup. It’s live for an hour or two a day while it has budget, in the morning UK time. Perhaps that will be my epitaph.