Fast and useful errors with ngless [3/5]

Feb 16, 2016

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please get in touch if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

Introduction to ngless
Perfect reproducibility using ngless
Fast and high quality error detection [this post]
Extending and interacting with other projects
Miscellaneous

If you are the rare person who just writes code without bugs (if your favorite editor is cat), then you can skip this post as it only concerns those of us who make mistakes. Otherwise, I will assume that /your code will have bugs/. Your code will have silly typos and large mistakes.

Too many tools work well, but fail badly; that is, if all their dependencies are there, all the files are exactly perfect and the user specificies all the right options, then the tool will work perfectly; but any mistake and you will get a bizarre error, which will be hard to fix. Thus,the tool is bad at failing. Ngless promises to work well and fail well.

Make some errors impossible

Let us recall our running example:

ngless "0.0" import OceanMicrobiomeReferenceGeneCatalog version "1.0" input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz') preprocess(input) using |read|:     read = substrim(read, min_quality=30)     if len(read) < 45:         discard mapped = map(input, reference='omrgc') summary = count(mapped, features=['ko', 'cog']) write(summary, ofile='output/functional.summary.txt') Note that we do not specify paths for the 'omrgc' reference or the functional map file. We also do not specify files for intermediate files. This is all implicit and you cannot mess it up. The Fastq encoding is auto-detected, removing one more opportunity for you to mess up (although you can specify the encoding if you really want to). Ngless always uses the three step output safe writing pattern:  	write the output to a temp file, 	sync the file and its directory to disk, 	rename the temp file to the final output name.  The final step is atomic. That is, the operating system garantees that it either fully completes or never executes even if there is an error, so that you never get a partial file. Thus, if there is an output file, you know that ngless finished without errors (up to that point, at least) and that the output is correct. No more asking "did the cluster crash affect this file? Maybe I need to recompute or maybe I count the number of lines to make sure it's complete". None of that: if the file is there, it is complete. Side-note: programming languages (or their standard libraries) should have support for this safe-output writing pattern. I do not know of any language that does. Make error detection fast Have you ever run a multi-step pipeline where the very last step (often saving the results) has a silly typo and everything fails disastrously at that point wasting you hours of compute time? I know I have. Ngless tries as hard as possible to make sure that doesn't happen. Although ngless is interpreted rather than compiled, it performs an initial pass over your script to check for all sorts of possible errors. Ngless is a typed language and all types are checked so that if you try to run the count function without first maping, you will get an error message. All arguments to functions are also checked. This even checks some rules that would be hard to impose using a more general purpose programming language: for example, when you call count, either (1) you are using a built-in reference which has its own annotation files or (2) you have to pass in the path to a GTF or gene map file so that the output of the mapping can be annotated and summarized. This constraint would be hard to express in, for example, Java or C++, but ngless can check this type of condition easily. The initial check makes sure that all necessary input files exist and can be read and even that any directories used for output are present and can be written to (in the script above, if a directory named output is not present, you will get an immediate error). If you are using your own functional map, it will read the file header to check that any features you use are indeed present (in the example above, it checks that the 'ko' and 'cog' features exist in the built-in ocean microbiome catalog). All typos and other similar errors are caught immediately. If you mistype the name of your output directory, ngless will let you know in 0.2 seconds rather than after hours of computation. You can also just run the tests with ngless -n script-name.ngl: it does nothing except run all the validation steps. Again, this is an idea that could be interesting to explore in the context of general purpose languages. Make error messages helpful  An unhelpful error message As much as possible, when an error is detected, the message should help you make sense of it and fix it. A tool cannot always read your mind, but as much as possible, ngless error messages are descriptive. For example, if you used an illegal argument/parameter to a function, ngless will remind you of what the legal arguments are. If it cannot write to an output file it will say it cannot write to an output file (and not just "IO Error"). If a file is missing, it will tell you which file (and it will tell you in about 0.2 seconds. Summary Ngless is designed to make some errors impossible, while trying hard to give you good error messages for the errors that will inevitably crop up.

Rabbit Thoughts

Discussion about this post