State of the Python/PyPi dependency graph

I usually work in Java/Maven environment, so when I explain to people that Python also has a package manager – a bit less heavy than maven – and that it’s working pretty well, I always have to answer the same question : « Ok, but how does it solve the transitive dependency hell ? »

Also known as the historic DLL Hell/Jar Hell etc… In short, when you depend on A and C, that A depends on B (version 1.2) and C depends on B (version 1.5) :  How do you choose which version of B you will take ?

I ended up trying to answer, not exactly that question, but why I never really had that problem in Python. So this article is the first of a three part series you could call « Dependency as a liability« .

In this part, I wanted to analyse the Python library world in terms of a full dependency graph – how every library depends on each other.

After talking with Tarek Ziadé about that, he told me how complicated things are right now. It seems that, for now, the way things are, the only complete and secure way to know what a package needs in terms of dependency is to execute its installation on every operating system. This was a bit out of my scope for now, so I took another way, just to see where it would lead me.

Analyzing setup.py files

For recent packages, following the Hitchiker’s Guide to packaging, the metadata of the package are stored in file called setup.py  that looks like this :

from distutils.core import setup

setup(
    name='TowelStuff',
    version='0.1.0',
    author='J. Random Hacker',
    author_email='jrh@example.com',
    packages=['towelstuff', 'towelstuff.test'],
    scripts=['bin/stowe-towels.py','bin/wash-towels.py'],
    url='http://pypi.python.org/pypi/TowelStuff/',
    license='LICENSE.txt',
    description='Useful towel-related stuff.',
    long_description=open('README.txt').read(),
    install_requires=[
        "Django >= 1.1.1",
        "caldav == 0.1.4",
    ],
)

You can notice a few things like the author, version, author_email, url, license… and what I was focusing on the install_requires parameter, where you declare all your dependencies. the problem is, that it may sound simple, but the setup.py file is a python script in itself, so the install_requires directive can be changed when the script is executed.

So I took my chances, and decided to create a project to extract dependencies from all packages on PyPi according to the install_requires parameter and see if this is mainly used statically or dynamically. So what the meta-deps project does is :

  • extract all packages from PyPi using the XML-RPC api;
  • download the releases and extract from the setup.py file the install_requires dependency;
  • Store the results in a csv file pypi-deps.csv;

If you want to re-use the raw data, you don’t need to re-execute the process (and overload PyPi servers in the meantime), just download the pypi-deps.csv file, it contains just these columns :

  • name of the dependency
  • version extracted
  • a base64 encoded, json string to store the list of dependencies : so you just need to execute json.loads(b64decode(…))

Results

So what comes out of all this ? This graph :

PyPi dependency graph generated using Gephi

PyPi dependency graph – click to see the interactive version

Ok, if you see it like that, you must think it looks like a huge jellyfish, and that i’m just joking with you. So I spent a little time creating and optimizing an interactive graph of the PyPi dependency (it seems to be best to open it using chrome) where you can scroll and see all the dependencies with all the metrics and explanation needed.

The next steps will be to do the same with Maven dependencies in a Java world, and compute metrics needed to compare the both.

Vale

7 Commentaires

  1. […] here it comes, the second part of a three part articles on dependencies in different world, the first part was about Python/PyPi dependencies and considering the size of the graph : 20661 Nodes, 14047 Edges,  I was able to show you the […]

  2. […] As time is always running out, i don’t think i’ll have the time in a while to work again on the data I collected for the last three articles, Going offline with Maven, State of the Maven/Java dependency graph and State of the PyPi/Python dependency graph. […]

  3. Fascinating post, thanks! I’d love to use that pypi-deps.csv file in a project I’m doing, but looks like the link is dead. Do you have a live version anywhere?

    1. You’re right, I’ve just updated the link : https://github.com/ogirardot/meta-deps/blob/master/pypi-deps.csv.lzma is the right place to look 🙂

  4. […] the dependency graph has been analyzed before; from this blog article by Olivier Girardot comes this fantastic […]

  5. […] 当然,之前已经分析过依赖关系图,从这blog article by Olivier Girardot来这个神奇的形象:   […]

Répondre à Jason Priem (@jasonpriem) Annuler la réponse.