As part of a recent HumanGeo effort, I was faced with the challenge of detecting patterns and anomalies in large geospatial datasets using various statistics and machine learning methods. Given the size of the datasets, the speed at which they should be processed along with other project constraints, I knew I had to develop a scalable solution that could easily be deployed to AWS. I preferred to use Apache Spark, given my personal and Humangeo’s positive experiences with it. In addition, we needed to develop a solution quickly, so naturally I turned to Python 3.4. It was already part of our tech stack and let’s be real, Python makes life easier (for the most part). Given these requirements, my quest to discover the best solution quickly led me to Amazon’s Elastic Map Reduce (EMR) service.
For those who aren’t familiar with EMR, it’s basically a scalable, Hadoop based Amazon web service
Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/I9i25bRGjPo/amazon-emr-spark-python3.html