With the introduction in Spark 1.4 of Window operations, you can finally port pretty much any relevant piece of Pandas’ Dataframe computation to Apache Spark parallel computation framework using Spark SQL’s Dataframe. If you’re not yet familiar with Spark’s Dataframe, don’t hesitate to checkout my last article RDDs are the new bytecode of Apache Spark and […]


With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I’d recommend to read the article : Introducing Dataframes in Spark for Large Scale Data Science from the Databricks blog. Dataframes are very popular among data scientists, personally I’ve mainly been using them with […]

Apache Spark’s default serialization relies on Java with the default readObject(…) and writeObject(…)  methods for all Serializable classes. This is a very fine default behavior as long as you don’t rely on it too much… Why ? Because Java’s serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization […]

Ever wanted to try out Apache Spark without actually having to install anything ? Well if you’ve got Docker, I’ve got a christmas present for you, a Docker image you can pull to try and run Spark commands in the Spark shell REPL. The image has been pushed to the Docker Hub here and can be […]

Many of the concepts of Apache Spark are pretty straightforward and easy to understand, however some lucky few can be badly misunderstood. One of the greatest misunderstanding of all is the fact that some still believe that « Spark is only relevant with datasets that can fit into memory, otherwise it will crash ». This is an understanding mistake, […]

Apache Spark est un moteur de calcul distribué visant à remplacer et fournir des APIs de plus haut niveau pour résoudre simplement des problèmes où Hadoop montre ses limitations et sa complexité. Ce billet fait partie d’une série de billet sur Apache Spark permettant d’approfondir certaines notions du système du développement, à l’optimisation jusqu’au déploiement. Un […]

I recently got the occasion of trying out Play 2 in Java and i must say the Play 2 Framwork looks actually really good in Java too. But, of course… there is a but, one of the few things that strikes you first, and i must say with great intensity, is the mandatory static methods that […]

After a few hours of searching through the Play 2 documentation, the play-framework google group and other blogs or sources, i finally found this piece of code that i decided to share with you. So if, like me, you wanted to remove the Scaladoc generation and packaging inside the ProductionDist that you can create from […]

Une fois n’est pas coutume, je commencerais cet article avec une photo de notre dernier Timeoff LT.    Ça fait surement cliché de dire ça, mais chaque timeoff est différent, et celui là n’a pas dérogé à la règle. J’étais beaucoup plus impliqué dans l’organisation des derniers (ceux où j’allais :p ) alors pour celui-ci […]

I’m using more and more Lucene these days, and getting in depth on a few subjects, today i’m going to talk to you about how to handle the new Highlighting features available with Lucene 4.1. One of the main achievements with this new version is the creation of the great PostingsHighlighter. Michael McCandless wrote a […]