NGless Miscellania [5/5]

Apr 25, 2016

NOTE: As of Apr 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please get in touch if you are interested in using ngless in your projects.

This is the last in a series of five posts introducing ngless.

Introduction to ngless
Perfect reproducibility using ngless
Fast and high quality error detection
Extending and interacting with other projects
Miscellaneous [this post]

Ngless has a few not so visible details that can come in handy.

Local installation

ngless relies on a few third-party utilities (bwa and samtools, besides any other modules you install) as well as possibly reference information. However, it does not require either (1) a super user install nor (2) fiddling with PATH variables or such. It is happy to install its data into your home directory and run from there.

You can also install it globally, of course, but in many academic settings, you need to ask permission to install a package globally, while you can do whatever you want in your home directory. NGless is designed with this in mind.

On the fly QC (quality control)

All FastQ files are automatically passed through a QC analysis when you load them and again after any preprocessing step. You do not need to specify QC as a separate step, it just happens. In fact, if possible, ngless will run it on the fly for efficiency reasons.

Best practices should be easy and QC is a best practice.

Subsample mode

Subsample mode simply throws away 99% of the data.

Why would anyone ever want to do this?

This allows you to quickly check whether your pipeline works as expected and the output files are as expected. For example:

ngless --subsample script.ngl

will run script.ngl in subsample mode, which will probably run much faster than the full pipeline, allowing to quickly spot any issues with your code. A 10 hour pipeline will finish in a few minutes when running in subsample mode.

Subsample mode also changes all your write() so that the output files include the subsample extension. That is, a call such as

write(output, ofile='results.txt')

will automatically get rewritten to

write(output, ofile='results.txt.subsample')

This ensures that you do not confuse subsampled results with the real thing. NGless is all about making sure your results are correct, so it tries to avoid confusing you as much as possible (this is similar to how it always writes output files with the atomic protocol so that you never get a partial results file).

Parallel processing & speed

The main goal of ngless is to save bioinformaticians time while improving the results. However, as a side benefit of having a well-defined language, the interpreter can take automatic advantage of multiple processors.

Consider the following script:

ngless '0.0' input = fastq('input.fq.gz') preprocess(input) using |r|:     r = substrim(r, min_quality=45)     if len(r) < 45:         discard mapped = map(input, reference='hg19') counted = count(mapped, features=['gene']) write(counted, ofile='genes.txt') Almost all the steps in the pipeline can take advantage of multiple processors:   	QC is performed on the fly as the file 'input.fq.gz' is being read.  	preprocess takes advantage of mulitple processors by processing reads in parallel  	map calls bwa which makes use of threads  	count again processes the output of mapping in parallel.  To use more than one core in ngless, just use the option -j with the number of threads you want. For example: ngless -j8 pipeline.ngl Will run with 8 cores, speeding the processing considerably.

Rabbit Thoughts

Discussion about this post