Use Grep in the C Locale When Working with ASCII Files

Jul 01, 2013

This post is a public service announcement: GNU Grep can be very slow in UTF-8 locale if you are only grepping an ASCII file.

Intro if you don't know non-bioinformatics: FASTA is a text format to store nucleotide sequences like this:

> headers start with ">" actatatatatatata > next header atttata

In this case, "actatatatatatata" is the first sequence and "atttata" is the next one.

Often you want to count the number of sequences in a a file. Given the text-based format, you can just use grep.

However, if you have ever done this:

grep '^>' big_file.fa | wc -l

or (even better because grep can count):

grep -c '^>' big_file.fa

Consider whether you should be doing:

LC_ALL=C grep -c '^>' big_file.fa

Here are measurements on my computer:

$ du -sh big_file.fa 1.3G    big_file.fa

It's a large, but not absurdly large, file.

$ time cat big_file.fa >/dev/null cat big_file.fa > /dev/null  0.01s user 2.04s system 99% cpu 2.056 total

Reading it from disk takes about two seconds (this is a network mounted share).

time grep -c '^>' big_file.fa 8310416 grep -c '^>' big_file.fa 8042.37s user 14.15s system 98% cpu 2:15:43.41 total

Grepping it takes over two hours! CPU usage was 98% through the whole process (the machine was doing a few other things, but nothing special).

Grepping it in the C locale takes about three seconds:

$ time LC_ALL=C grep -c '^>' big_file.fa 8310416 LC_ALL=C grep -c '^>' big_file.fa  1.73s user 1.14s system 99% cpu 2.868 total

So in the C locale, grep took less than one second more when compared to just reading the file from the network disk.

In my case, this was in the middle of a script and it was just computing some statistics. Mapping these same reads to the human genome takes just over two minutes (admitedly, we use 8 cores for that, but that comes out at ca twenty minutes if we had just used one core).

I believe this is a common occurence in sequence based computing pipelines: the mapping step is actually very fast. Then, a small performance bug here or there makes the whole thing take much more than it should.

If you're wondering why grep is so bad outside the C locale:

By default, on my system (check yours), the UTF-8 locale is used. UTF-8 is a wonderful thing, but grep is slow in handling it and we do not need it here.

More information on 503 Service Unavailable [1] It is faster to do grep '>' big_file.fa, but who knows if someone didn't sneak a > into a header

Rabbit Thoughts

Discussion about this post