Joining a billion rows 20x faster than Apache Spark

One of Databricks’ most well-known blogs is the blog where they describe joining a billion rows in a second on a laptop. Since this is a fairly easy benchmark to replicate, we thought, why not try it on SnappyData and see what happens? We found that for joining two columns with a billion rows, SnappyData is nearly 20x faster.

Let’s start with the benchmark as in the original post. The machine is a:

Dell Latitude E7450 laptop with Core(TM) i7-5600U CPU @ 2.60GHz having 16GB of RAM.

Start the SnappyData shell with some decent memory (required for the data load test):

(the GC options are similar to a default SnappyData cluster)

Define a simple benchmark util function

Let’s do a warmup first that will also initialize some Spark components and sum a billion numbers:

> Time taken in Spark 2.0 (sum of a billion): 0.70 seconds

Very impressive sub-second timing.

Let’s try the same using SnappySession (SnappyData’s entry point


Original URL: http://feedproxy.google.com/~r/feedsapi/BwPx/~3/-tZbI0QCqvY/joining-billion-rows-faster-than-apache-spark

Original article

Comments are closed.

Proudly powered by WordPress | Theme: Baskerville 2 by Anders Noren.

Up ↑

%d bloggers like this: