At their core, package repositories sound like a dream: with a simple command one gains access to countless pieces of software, libraries and more to make using an operating system or developing software a snap. Yet the rather obvious flip side to this is that someone has to maintain all of these packages, and those who make use of the repository have to put their faith in that whatever their package manager fetches from the repository is what they intended to obtain.
In short, who can tell when a package is truly ‘abandoned’, guarantee that a package is free from malware, and how does one begin to provide insurance against a package being pulled and half the internet collapsing along with it?
NPM As Case Study
In Andrew Sampson’s Twitter thread, they describe how during the process of getting packages for a project called bebop published in various software repositories, they found that this name was unclaimed except in NPM. As this package hadn’t been updated in a long time, they assumed it was likely abandoned, and found an entry on package name disputes in the NPM documentation. This entry has since been heavily altered, because of what happened next.
Following the steps in the NPM documentation, Andrew emailed the author of the existing bebop package and CC’ing NPM support as requested. After four weeks had passed, Andrew got an email from NPM support indicating that Andrew now owned the NPM package:
It all seemed fine and dandy, until it turned out that not only was bebop still being actively developed by one Zach Kelling, but that at least thirty different packages on NPM depended on it. During the subsequent communication, Zach and Andrew realized that the email address produced by the
npm owner ls command was not even associated with the package, explaining why Zach never got a message about the ownership transfer.
The obvious failures here are many: from NPM failing to ascertain that it had an active communication channel to a package owner, to no clear way to finding out whether a package is truly abandoned, to NPM apparently failing to do basic dependency checking before dropping a package. Perhaps most astounding here is the resulting “solution” by NPM, with Zach not getting ownership of the package restored, but only a GitHub Pro subscription and $100 coupon to buy merchandise from the GitHub Shop. Andrew ended up compensating Zach for the package name.
In their thoughts on this whole experience, Andrew makes it clear that they don’t feel that a software repository should have the right to change ownership of a package, that this responsibility should always lie with the owner. That said, as a matter of practicality, one could argue that a package could be considered abandoned if it has not been downloaded in a long time and no other software depends on it.
But is NPM really an outlier? How does their policy compared to more maintainer-centric models used by other repositories, such as those provided with the various Linux and BSD distributions?
The Burden of Convenience
A feature of the NPM software repository is that it’s highly accessible, in the sense that anyone can create an account and publish their own packages with very little in the way of prerequisites. This contrasts heavily with the Debian software repository. Here the procedure is that in order to add a package to the Debian archive, you have to be a Debian Developer, or have someone who is one, sponsor you and upload your packages on your behalf.
While it’s still possible to create packages for Debian and distribute them without either of these prerequisites, it means that a user of your software has to either manually download the DEB file and install it, or add the URL of your archive server to the configuration files of their package manager as a Personal Package Archive (PPA) to enable installation and updating of the package along with packages from the official Debian archive.
The basic principle behind the Debian software repository and those of other distributions is that of integrity through what is essentially a chain of trust. By ensuring that everyone who contributes something (e.g. a package) to the repository is a trusted party by at least one person along this chain of contributors, it’s virtually assured that all contributions are legitimate. Barring security breaches, users of these official repositories know software installed through any of the available packages is as its developers intended it to be.
This contrasts heavily with specialty software repositories that target a specific programming language. PyPI as the official Python software repository has similar prerequisites as NPM, in that only a user account is required to start publishing. Other languages like Rust (Crates.io) and Java/Kotlin (Sonatype Maven) follow a similar policy. This is different from Tex (CTAN) and Perl (CPAN), which appear to provide some level of validation by project developers. Incidentally CPAN’s policy when it comes to changing a package’s maintainer is that this is done only after much effort and time, and even then it’s preferred to add a co-maintainer rather than drop or alter the package contents.
Much of these differences can seemingly be summarized by the motto “Move fast and break things“. While foregoing the chain of trust can make a project move ahead at breakneck speed, this is likely to come at a cost. Finding the appropriate balance here is paramount. For example in the case of an operating system, this cavalier approach to quality, security, and reliability is obviously highly undesirable.
One might postulate that “break things” is also highly undesirable when deploying a new project to production and having it fall over because of a pulled dependency or worse. Yet this is where opinions seem to differ strongly to the point where one could say that the standard package manager for a given programming language (if any) is a direct reflection of the type of developer who’d be interested in developing with the language, and vice-versa.
Do You Really Need That?
As anyone who has regularly tried to build a Node.js project that’s a few months old or an Maven-based Java 6 project can likely attest to, dependencies like to break. A lot. Whether it’s entire packages that vanish, or just older versions of packages, the possibility of building a project without spending at least a few minutes cursing and editing project files will gradually approach zero as more time passes.
In light of what these dependencies sometimes entail, it’s perhaps even more ludicrous that they are dependencies at all. For example, the
left-pad package in NPM that caused many projects to fall over consists of only a handful lines of code that does exactly what it says on the tin. It does raise the question of how many project dependencies can be tossed without noticeably affecting development time while potentially saving a lot of catastrophic downtime and easing maintenance.
When your file browser hangs for a few seconds or longer when parsing the
node_modules directory of a Node.js project because of how many folders and files are in it, this might be indicative of a problem. Some folk have taken it up them to cut back on this bloat, such as in this post by Adam Polak who describes reducing the size of the
node_modules folder for a project from 700 MB to 104 MB. This was accomplished by finding replacements for large dependencies (which often pull in many dependencies of their own), removing unneeded dependencies, and using custom code instead of certain dependencies.
Another good reason to cut back on dependencies has to do with start-up time, as noted by Stefan Baumgartner in a recent blog post. While obviously it’s annoying to type
npm install and have enough time before it finishes to take a lunch break, he references Mikhail Shilkov’s work comparing cold start times with cloud-based service offerings. Increasing the total size of the deployed application also increased cold start times significantly, on the order of seconds. These are billed-for seconds that are essentially wasted money, with large applications wasting tens of seconds doing literally nothing useful while starting up and getting the dependencies sorted out.
This extra time needed is also reflected in areas such as continuous integration (CI) and deployment (CD), with developers noting increased time required for building e.g. a Docker image. Clearly, reducing the dependencies and their size to a minimum in a project can have very real time and monetary repercussions.
KISS Rather Than Breaking Things
There’s a lot to be said for keeping things as simple as reasonably possible within a software project. While it’s undoubtedly attractive to roll out the dump truck with dependencies and get things done fast, this is an approach that should ideally be reserved for quick prototypes and proof-of-concepts, rather than production-level code.
At the end of the day, making a project bullet-proof is something that should be appreciated more. That includes decreasing the reliance on code and infrastructure provided by others, especially if said code and/or infrastructure is provided free of cost. If your business plan includes the continued provision of certain free services and software, any sane investor should think twice before investing in it.