Splitting a multi-fasta file
Splitting a huge multi-fasta files can be very useful, especially if you want to reduce the memory footprint of your analyses. Here are some ways to do that.
One-liners:
Using awk, we can easily split a file (multi.fa) into chunks of size N (here, N=500), by using the following one-liner:
awk 'BEGIN {n=0;} /^>/ {if(n%500==0){file=sprintf("chunk%d.fa",n);} print >> file; n++; next;} { print >> file; }' < multi.fa
This will result in multiple files: chunk0.fa containing the sequences 0 to 499, chunk500.fa containing the sequences 500 to 999, etc. The last file however will likely have far fewer sequences in them (the number of sequences modulo 500).
So, if you want to make sure each segment has approximately the same number of sequences in them, you can try fiddling a bit more with awk…
awk -v chunksize=$(grep ">" multi.fasta -c) 'BEGIN{n=0; chunksize=int(chunksize/10)+1 } /^>/ {if(n%chunksize==0){file=sprintf("chunk%d.fa",n);} print >> file; n++; next;} { print >> file; }' < multi.fasta
… but as you can see, this quickly becomes a bit messy. It’s pretty fast though, so you may as well use it!
Another great solution is genome tools (gt), which you can find here: http://genometools.org/, which has the following simple command:
gt splitfasta -numfiles 10 multi.fasta
Biopython solution
The above solutions all work, but possibly you want a bit more freedom. For this, it’s probably best to learn Biopython, from which you can do all kinds of stuff with your sequences. Here’s a basic example of how to chop up your multi-fasta file, derived from the official Biopython website.
# split_fasta.py (assumes you have biopython installed, e.g. with pip install biopython)
import sys, math
from Bio import SeqIO
def batch_iterator(iterator, batch_size):
"""
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
(derived from https://biopython.org/wiki/Split_large_file)
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.__next__()
except StopIteration:
entry = None
if entry is None:
break # EOF = end of file
batch.append(entry)
if batch:
yield batch
if(len(sys.argv) != 3):
sys.exit("usage: split_fasta.py MULTI_FASTA_FILE N_CHUNKS")
ffile=sys.argv[1] # fasta file
chunks=sys.argv[2] # number of chunks
nseq = len([1 for line in open(ffile) if line.startswith(">")])
chunksize=math.ceil(nseq/int(chunks))
print("Splitting multi-fasta file of", nseq, "sequences into chunks of size", chunksize)
records = SeqIO.parse(open(ffile), "fasta")
for i, batch in enumerate(batch_iterator(records, chunksize)):
filename = "chunk_%i.fasta" % (i + 1)
with open(filename, "w") as handle:
count = SeqIO.write(batch, handle, "fasta")
print("Wrote %i sequences to %s" % (count, filename))
sys.exit("Done.")
The solution above is of course endlessly versatile, giving you the freedom to do whatever you wish with your multi-fasta file.
Happy coding!