[Bio-Linux] Blasting Multiple Fasta Files
Tim Booth
tbooth at ceh.ac.uk
Tue May 5 11:08:20 EDT 2015
Hi Zain,
So, I think you are saying that if you have a directory of files like
this:
seqs_000000_to_000999.fsa
seqs_001000_to_001999.fsa
seqs_002000_to_002999.fsa
seqs_003000_to_003999.fsa
...etc
You want to run:
blastx -db foo -infile seqs_000000_to_000999.fsa -out seqs_000000_to_000999.blastx
...then...
blastx -db foo -infile seqs_001000_to_001999.fsa -out seqs_001000_to_001999.blastx
...then...
blastx -db foo -infile seqs_002000_to_002999.fsa -out seqs_002000_to_002999.blastx
...then...
blastx -db foo -infile seqs_003000_to_003999.fsa -out seqs_003000_to_003999.blastx
...etc
This can be done with a shell loop. The tricky bit is generating the output file name:
$ for f in *.fasta ; do
> outname=$(basename $f .fasta).blastx
> blastx -db foo -query $f -out $outname
> done
A nifty way of running jobs like this is with 'parallel' which is
pre-installed on Bio-Linux 8 and can run multiple jobs at once and even
send them to other remote machines for you. Here's the basic invocation
(yes, it's a bit cryptic - it's based on the xargs tool):
$ ls *.fasta | parallel --res out blastx -db foo -query
Then to see what files were outputted:
$ find out -name stdout
Hope that helps.
(Just before sending this, I see that Andreas recommended parallel too!)
TIM
On Tue, 2015-05-05 at 15:31 +0100, Zain A Alvi wrote:
> Hi Marty,
>
>
> I apologize for the confusion. I am splitting a fasta file that
> contains approximately 100,000 fasta sequences to 100 fasta files that
> contains 1000 sequences each. I am hoping this will expedite the
> BLASTx process.
>
>
> Kind regards,
>
>
>
> Zain
>
>
>
> ______________________________________________________________________
> From: Martin Gollery <mgollery at unr.edu>
> Sent: Tuesday, May 5, 2015 10:23 AM
> To: Bio-Linux help and discussion
> Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files
>
> Running a million BLASTX jobs on one sequence each is not going to
> save you time. It is better to run one BLASTX job on a million
> sequences.
>
>
> -Marty
>
>
>
>
> On Tue, May 5, 2015 at 7:00 AM, Zain A Alvi
> <zain.alvi at student.shu.edu> wrote:
> Dear Sir or Madam,
>
>
>
> I hope everything is well. I have downloaded all the viral
> protein sequences from the NCBI refseq database using
> their script from their E-book. I have de-novo assembled some
> viral genomes and I know BLASTX takes a long time if the fasta
> is large. I have been able to split the large fasta file
> based on an user specified contig number in each new fasta
> file.
>
>
> I was wondering is there a method to run BLASTX automatically
> on each of the fasta files one at a time so that it will be
> able to complete in a "shorter" amount of time as compared to
> BLASTing the whole large de-novo assembled fasta file. Then I
> was hoping to concatenate all the results into one file.
>
>
>
> Sincerely,
>
>
>
> Zain
>
>
>
>
> _______________________________________________
> Bio-Linux mailing list
> Bio-Linux at nebclists.nerc.ac.uk
> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
>
>
>
>
>
--
Tim Booth <tbooth at ceh.ac.uk>
NERC Environmental Bioinformatics Centre
Centre for Ecology and Hydrology
Maclean Bldg, Benson Lane
Crowmarsh Gifford
Wallingford, England
OX10 8BB
http://environmentalomics.org/bio-linux
+44 1491 69 2297
More information about the Bio-linux-list
mailing list