State of the Maven/Java dependency graph

So here it comes, the second part of a three part articles on dependencies in different world, the first part was about Python/PyPi dependencies and considering the size of the graph : 20661 Nodes, 14047 Edges,  I was able to show you the graph in an interactive javascript app using SigmaJS. But this times it’s different, after extracting the metadata from Maven repositories, the raw data file generated weights 273M, and the size of the whole directed dependency graph is 186 384 Nodes and 1 229 083 Edges, in other words, it’s going to be tough to show you the whole graph interactively but the raw data, the graph file and the Gephi file are available on the GitHub project.

Handling that much data comes with a cost, that your machine must be prepared to pay… For example, as I tried to export the whole graph into a svg file, Gephi tried to use more than my 16G of RAM and eventually couldn’t achieve this. Fortunately, it was no problem to extract it to PNG (and with High Definition), so here comes the pictures.

Using Yifan Hu’s layout

As you see the graph is, as expected, pretty dense and this spatialization is not exactly the best one to see what’s going on, but I tried it first (even if it’s not suitable for large-scale graph processing) in order to compare it with the last article’s results. As we can see below, the Java/Python eco-systems are really different in terms of dependency and library usage :

maven-deps-ni-labels

Maven/Java dependency graph using Yifan Hu’s layout

PyPi dependency graph generated using Gephi

PyPi dependency graph using Yifan Hu’s layout

So Yifan Hu’s layout is really great for sparsed/simple graphs and even if it took me more than 2 hours to properly compute it with a millin edges, it’s worth it just to compare visually the two parts. But now if we want to analyse and get something out of the Maven metadata, let’s use a more « Large Scale Graph » oriented visualisation. For that we need to choose a parallelizable graph spatialization algorithm.

A more suitable approach using ForceAtlas 2

Force Atlas 2 is a much more suitable algorithm to process large quantity of nodes/edges, firstly because it allowed me to parallelize this computation over 7 CPU, but also because it gives us a clearer overview of what’s going on, what is the most popular library and other metrics like that :

Maven dependency graph

Maven dependency graph using Force Atlas 2 (click to see a higher resolution)

So…

So now we have the processed data, just at first sight we can see that the Java ecosystem is much more connected than the Python one, there’s no judgment here and we will analyse this data in more depth in the next article not to conclude who’s the best but more to gain a clearer understanding of a way to go forward for both worlds.

Vale

3 Commentaires

  1. Hey man. Cool project you are doing there. I am working on something similar. On http://www.versioneye.com we show interactive dependency graphs for java, too. For example, here is one for Spring-Core: http://www.versioneye.com/package/org~springframework–spring-core. Currently we are working on a new API for VersionEye, to make the data available for other developers. Maybe you want to take advantage of it.

  2. […] the last article I extracted all the libs’ metadata and dependencies link, so we know what depends on what. So […]

  3. […] while to work again on the data I collected for the last three articles, Going offline with Maven, State of the Maven/Java dependency graph and State of the PyPi/Python dependency […]

Laisser un commentaire