Catégorie Data

From Pandas to Apache Spark’s Dataframe

With the introduction in Spark 1.4 of Window operations, you can finally port pretty much any relevant piece of Pandas’ Dataframe computation to Apache Spark parallel computation framework using Spark SQL’s Dataframe. If you’re not yet familiar with Spark’s Dataframe, don’t hesitate to checkout my last article RDDs are the new bytecode of Apache Spark and […]

RDDs are the new bytecode of Apache Spark

With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I’d recommend to read the article : Introducing Dataframes in Spark for Large Scale Data Science from the Databricks blog. Dataframes are very popular among data scientists, personally I’ve mainly been using them with […]

Changing Spark’s default java serialization to Kryo

Apache Spark’s default serialization relies on Java with the default readObject(…) and writeObject(…)  methods for all Serializable classes. This is a very fine default behavior as long as you don’t rely on it too much… Why ? Because Java’s serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization […]

Highlighting field in memory-based Lucene indexes

I’m using more and more Lucene these days, and getting in depth on a few subjects, today i’m going to talk to you about how to handle the new Highlighting features available with Lucene 4.1. One of the main achievements with this new version is the creation of the great PostingsHighlighter. Michael McCandless wrote a […]

How to test and understand custom analyzers in Lucene

I’ve began to work more and more with the great « low-level » library Apache Lucene created by Doug Cutting. For those of you that may not know, Lucene is the indexing and searching library used by great entreprise search servers like Apache Solr and Elasticsearch. When you start to index and search data, most of the […]

Elasticsearch is the way

Don’t get me wrong, i love Apache Solr, i think it’s a wonderful project and the versions 4.x are definitely something you should check out when building a proper search engine. But Elasticsearch, at least for me, is now the way to the future. If you need a few reasons why, read on : Out of […]

Sharing PyPi/Maven dependency data

As time is always running out, i don’t think i’ll have the time in a while to work again on the data I collected for the last three articles, Going offline with Maven, State of the Maven/Java dependency graph and State of the PyPi/Python dependency graph. So, as it took me a long time to build […]