Many of the concepts of Apache Spark are pretty straightforward and easy to understand, however some lucky few can be badly misunderstood. One of the greatest misunderstanding of all is the fact that some still believe that « Spark is only relevant with datasets that can fit into memory, otherwise it will crash ».
This is an understanding mistake, Spark being easily associated as a « Hadoop using RAM more efficiently », but it still is a mistake.
Spark is by default doing the best it can to load the datasets it handles in memory. Still when the handled datasets are too large to fit into memory, automatically (or should i say auto-magically) these objects will be spilled to disk. This is one of the main features of Spark coined by the expression « graceful degradation » and it was very well illustrated by these two charts in Matei Zaharia’s dissertation : An Architecture for Fast and General Data Processing on Large Clusters :

Behaviour of Spark with less/more RAM, extracted from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf
So the first chart clearly shows something interesting for us, It shows us that the behavior of Spark when you give it more or less RAM is pretty much linear in terms of execution time. In other words, the more RAM Spark can use, the quicker your computation will run, but if you give it less and less RAM, in the end Spark will behave like Hadoop, flushing to disk as much as possible.
The second chart is also interesting for debunking the urban legend of « Spark will only work if your datasets fit in RAM » showing how Spark will handle larger and larger datasets, once again its behavior is practically linear between the time the computation takes and the size of the dataset (for a given computation).
In the end, not only Spark can handle large datasets but It will gracefully adapt to the amount of memory you give it.
Dear all,
All i can find until now is some recommender engines that build and deploy everything in memory based on csv files as datasets, so if have about 1 M of data and about 3700 user per day.
Im my case , my company has about 1 M active item , about 4000 active user (avg) per day and about 4.5 M page visit per week (avg).
The idea to build, train and recommand items in memory seams so bad , so i’m thinking to build a recommander engine but kind of real-time ! how ? that’s what i’m looking for, maybe train data and deploy it to an indexer like elasticsearch or something similar to recommand items.
How do I build real-time recommender system with Apache Spark? Any seggestion ?
Many thanks