Perfect reproducibility using ngless [2/5]
NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please get in touch if you are interested in using ngless in your projects.
This is the second of a series of five posts introducing ngless.
Perfect reproducibility [this post]
Extending and interacting with other projects
Miscellaneous
Perfect reproducibility using ngless
With ngless, your analysis is perfectly reproducible forever. In particular, this is achieved by the use of explicit version strings in the ngless scripts. Let us the first two lines in the example we used before:
ngless "0.0" import OceanMicrobiomeReferenceGeneCatalog version "1.0"
The first line notes the version of ngless that is to be used to run this script. At the moment, ngless is in a pre-release phase and so the version string is "0.0". In the future, however, this will enable ngless to keep improving while still allowing all past scripts to work exactly as intendended. No more, "I updated some software package and now all my scripts give me different results." Everything is explicitly versioned.
There are several command line options for ngless, which can change the way that it works internally (e.g., where it keeps its temporary files, whether it is verbose or not, how many cores it should use, &c). You can also use a configuration file to set these options. However, no command line or configuration option change the final output of the analysis. Everything you need to know about the results is in the script.
Reproducible, extendable, and reviewable
It's not just about reproducibility. In fact, reproducibility is often not that great per se: if you have a big virtual machine image with 50 dependencies, which runs a 10,000 line hairy script to reproduce the plots in a paper, that's reproducible, but not very useful (except if you want to really dig in). Ngless scripts, however, are easily extendable and even inspectable. Recall the rest of the script (the bits that do actual work):
input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz') preprocess(input) using |read|: read = substrim(read, min_quality=30) if len(read) < 45: discard mapped = map(input, reference='omrgc') summary = count(mapped, features=['ko', 'cog']) write(summary, ofile='output/functional.summary.txt')
If you have ever worked a bit with NGS data, you can probably get the gist of what is going on. Except for maybe some of the details of what substrim
does (it trims the read by finding the largest sustring where all nucleotides are of at least the given quality, see docs), your guess of what is going on would be pretty accurate. It is easily extendable: If you want to add another functional table, perhaps using kegg modules, you just need to edit the features
argument to the count function (you'd need to know which ones are available, but after looking that up, it's a trivial change). If you now wanted to perform a similar analysis on your data, I bet you could easily adapt the script for your settings. § A few weeks ago, I asked on twitter/facebook: Science people: anyone know of any data on how often reviewers check submitted software/scripts if they are available? Thanks. Science people: anyone know of any data on how often reviewers check submitted software/scripts if they are available? Thanks. — Luis Pedro Coelho (@luispedrocoelho) January 11, 2016 //platform.twitter.com/widgets.js I didn't get an answer for the original question, but there was a bit of discussion and as far as I know nobody really checks code submitted along with papers (which is, by itself, not a enough of a reason to not demand code). However, if you were reviewing a paper and the supplemental material had the little script above, you could easily check it out and make sure the authors used the right settings and databases. The resulting code is inspectable and reviewable.