Splitting a multi-fasta file

Jul 27

Splitting a huge multi-fasta files can be very useful, especially if you want to reduce the memory footprint of your analyses. Here are some ways to do that.

One-liners:

Using awk, we can easily split a file (multi.fa) into chunks of size N (here, N=500), by using the following one-liner:

awk 'BEGIN {n=0;} /^>/ {if(n%500==0){file=sprintf("chunk%d.fa",n);} print >> file; n++; next;} { print >> file; }' < multi.fa

This will result in multiple files: chunk0.fa containing the sequences 0 to 499, chunk500.fa containing the sequences 500 to 999, etc. The last file however will likely have far fewer sequences in them (the number of sequences modulo 500).

So, if you want to make sure each segment has approximately the same number of sequences in them, you can try fiddling a bit more with awk…

awk -v chunksize=$(grep ">" multi.fasta -c) 'BEGIN{n=0; chunksize=int(chunksize/10)+1 } /^>/ {if(n%chunksize==0){file=sprintf("chunk%d.fa",n);} print >> file; n++; next;} { print >> file; }' < multi.fasta

… but as you can see, this quickly becomes a bit messy. It’s pretty fast though, so you may as well use it!

Another great solution is genome tools (gt), which you can find here: http://genometools.org/, which has the following simple command:

gt splitfasta -numfiles 10 multi.fasta

Biopython solution

The above solutions all work, but possibly you want a bit more freedom. For this, it’s probably best to learn Biopython, from which you can do all kinds of stuff with your sequences. Here’s a basic example of how to chop up your multi-fasta file, derived from the official Biopython website.

# split_fasta.py (assumes you have biopython installed, e.g. with pip install biopython)

import sys, math
from Bio import SeqIO

def batch_iterator(iterator, batch_size):
    """
    This is a generator function, and it returns lists of the
    entries from the supplied iterator.  Each list will have
    batch_size entries, although the final list may be shorter.
    (derived from https://biopython.org/wiki/Split_large_file)
    """
    entry = True  # Make sure we loop once
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = iterator.__next__()
            except StopIteration:
                entry = None
            if entry is None:
                break # EOF = end of file
            batch.append(entry)
        if batch:
            yield batch

if(len(sys.argv) != 3):
        sys.exit("usage: split_fasta.py MULTI_FASTA_FILE N_CHUNKS")

ffile=sys.argv[1]  # fasta file
chunks=sys.argv[2] # number of chunks
nseq = len([1 for line in open(ffile) if line.startswith(">")])
chunksize=math.ceil(nseq/int(chunks))
print("Splitting multi-fasta file of", nseq, "sequences into chunks of size", chunksize)

records = SeqIO.parse(open(ffile), "fasta")
for i, batch in enumerate(batch_iterator(records, chunksize)):
        filename = "chunk_%i.fasta" % (i + 1)
        with open(filename, "w") as handle:
                count = SeqIO.write(batch, handle, "fasta")
        print("Wrote %i sequences to %s" % (count, filename))

sys.exit("Done.")

The solution above is of course endlessly versatile, giving you the freedom to do whatever you wish with your multi-fasta file.

Happy coding!

Bram van Dijk

Splitting a multi-fasta file

One-liners:

Biopython solution

The secret lives of mobile genetic elements: a multi-level perspective

Javascript programming - Part III (Game of Life)