Thoughts on software sustainability
This is motivated by the latest Microbinfie podcast, episode #24: Experimental projects lead to experimental software, which discussed the issue of “scientific software sustainability”.
As background, in our group, we have made a public commitment to supporting tools over at least 5-years, starting with the date of publication.¹
I have increasingly become convinced that code made available for reproducibility, what I call Extended Methods, is one thing that should be encouraged and rewarded, but is very different from making tools, which is another thing that should be encouraged and rewarded. The criteria for evaluating these two modes of code release should be very different. The failure to distinguish between Extended Methods code and tools plagues the whole discussion in the discipline. For example, internal paths may be acceptable for Extended Methods, but should be an immediate Nope for tools. I would consider this as immediate cause to recommend rejection if a manuscript was trying to sell code with internal path as a tool. On the other hand, it should also be more accepted to present code that is explicitly marked as Extended Methods and not necessarily intended for widespread use. Too often, there is pressure to pretend otherwise and blur these concepts.
Our commitment is only for tools, which we advertise as for others to use. Code that is adjacent to a results-driven paper does not come with the same guarantees! How do you tell? If we made a bioconda release, it’s a tool.
Our long-term commitment has the most impact before publication! In fact, ideally, it should have almost no impact on what we do after publication. This may sound like a paradox, but if tools are written by students/postdocs who will move on, the lab as a whole needs to avoid being in a situation where, after the student has moved on, someone else (including me!) is forced to debug unreadable, untested, code that is written in 5 different programming languages with little documentation. On the other hand, if we have it set up so that it runs on a continuous integration (CI) system with an exhaustive test suite and it turns out that a new version of Numpy breaks one of the tests, it is not so difficult to find the offending function call, add a fix, check with CI that it builds and passes the tests, and push out a new release. Similarly, having bioconda releases, cuts down a lot of support requests to use bioconda to install.
¹ In practice, this will often be a longer period, on both ends, as tools are supported pre-publication (e.g., macrel is only preprinted, but we will support it) and post-publication (I am still fixing the occasional issue in mahotas, even though the paper came out in 2013).