NGLess timing benchmarks
As part of finalizing a manuscript on NGLess, we have run some basic timing benchmarks comparing NGLess to MOCAT2 (our previous tool) and another alternative for profiling a community based on a gene catalog, namely htseq-count.
The task being profiled is that performed in the NGLess tutorials for the human gut and the ocean: 3 metagenomes are functionally profiled by using a gene catalog as a reference. The time reported is for completing all 3 samples (repeated 3 times to get some variability measure).
The results are that NGLess is overall much faster than the alternatives (note that the Y-axis measures the number of seconds in log-scale). For the gut dataset, MOCAT takes 2.5x, while for the ocean (tara) one, it takes 4x longer.
The Full column contains the result of running the whole pipeline, where it is clear that NGLess is much faster than MOCAT2. The other elements are in MOCAT nomenclature:
ReadTrimFilter: preprocessing the FastQ files
Screen: mapping to the catalog
Filter: postprocessing the BAM files
Profile: generating feature counts from the BAM files
Htseq-count works well even for this settings which is outside of its original domain (it was designed for RNA-seq, where you have thousands of genes, as opposed to metagenomics, where millions are common). NGLess is still much faster, though.
Note too that for MOCAT, the time it takes for the Full step is simply the addition of the other steps, but in the case of NGLess, when running a complete pipeline, the interpreter can save time.
The htseq-count benchmark is still running, so final results will only be available next week.
★
I also tried to profile using featureCounts (website), but that tool crashed after using up 800GB of RAM. I might still try it on the larger machines (2TiB of RAM), but it seems pointless.
The scripts and preprocessed data for this benchmark are at https://github.com/BigDataBiology/ngless2018benchmark