Many of the concepts of Apache Spark are pretty straightforward and easy to understand, however some lucky few can be badly misunderstood. One of the greatest misunderstanding of all is the fact that some still believe that « Spark is only relevant with datasets that can fit into memory, otherwise it will crash ».
This is an understanding mistake, Spark being easily associated as a « Hadoop using RAM more efficiently », but it still is a mistake.
Spark is by default doing the best it can to load the datasets it handles in memory. Still when the handled datasets are too large to fit into memory, automatically (or should i say auto-magically) these objects will be spilled to disk. This is one of the main features of Spark coined by the expression « graceful degradation » and it was very well illustrated by these two charts in Matei Zaharia’s dissertation : An Architecture for Fast and General Data Processing on Large Clusters :
So the first chart clearly shows something interesting for us, It shows us that the behavior of Spark when you give it more or less RAM is pretty much linear in terms of execution time. In other words, the more RAM Spark can use, the quicker your computation will run, but if you give it less and less RAM, in the end Spark will behave like Hadoop, flushing to disk as much as possible.
The second chart is also interesting for debunking the urban legend of « Spark will only work if your datasets fit in RAM » showing how Spark will handle larger and larger datasets, once again its behavior is practically linear between the time the computation takes and the size of the dataset (for a given computation).
In the end, not only Spark can handle large datasets but It will gracefully adapt to the amount of memory you give it.