One of Databricks’ most well-known blogs is the blog where they describe joining a billion rows in a second on a laptop. Since this is a fairly easy benchmark to replicate, we thought, why not try it on SnappyData and see what happens? We found that for joining two columns with a billion rows, SnappyData is nearly 20x faster.
Let’s start with the benchmark as in the original post. The machine is a:
Dell Latitude E7450 laptop with Core(TM) i7-5600U CPU @ 2.60GHz having 16GB of RAM.
Start the SnappyData shell with some decent memory (required for the data load test):
(the GC options are similar to a default SnappyData cluster)
Define a simple benchmark util function
Let’s do a warmup first that will also initialize some Spark components and sum a billion numbers:
> Time taken in Spark 2.0 (sum of a billion): 0.70 seconds
Very impressive sub-second timing.
Let’s try the same using SnappySession (SnappyData’s entry point
Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/-tZbI0QCqvY/joining-billion-rows-faster-than-apache-spark