Handling that much data comes with a cost, that your machine must be prepared to pay… For example, as I tried to export the whole graph into a svg file, Gephi tried to use more than my 16G of RAM and eventually couldn’t achieve this. Fortunately, it was no problem to extract it to PNG (and with High Definition), so here comes the pictures.
Using Yifan Hu’s layout
As you see the graph is, as expected, pretty dense and this spatialization is not exactly the best one to see what’s going on, but I tried it first (even if it’s not suitable for large-scale graph processing) in order to compare it with the last article’s results. As we can see below, the Java/Python eco-systems are really different in terms of dependency and library usage :
So Yifan Hu’s layout is really great for sparsed/simple graphs and even if it took me more than 2 hours to properly compute it with a millin edges, it’s worth it just to compare visually the two parts. But now if we want to analyse and get something out of the Maven metadata, let’s use a more « Large Scale Graph » oriented visualisation. For that we need to choose a parallelizable graph spatialization algorithm.
A more suitable approach using ForceAtlas 2
Force Atlas 2 is a much more suitable algorithm to process large quantity of nodes/edges, firstly because it allowed me to parallelize this computation over 7 CPU, but also because it gives us a clearer overview of what’s going on, what is the most popular library and other metrics like that :
So now we have the processed data, just at first sight we can see that the Java ecosystem is much more connected than the Python one, there’s no judgment here and we will analyse this data in more depth in the next article not to conclude who’s the best but more to gain a clearer understanding of a way to go forward for both worlds.