[Bio-Linux] Blasting Multiple Fasta Files

Wed May 6 00:33:28 EDT 2015

Hi Everyone,

Thank you for all the great and helpful recommendations, especially Tim, Tony, Dr. Beall, and Andreas.  I am trying to do exactly what Tim has showed and having BLASTx run on each fasta file one at a time, but not at the same time.  It should go through each fasta file one at a time and do BLASTX and then move onto next fasta file until there are no fasta files left in the folder.

Would something like this work as well: 

for input in *.fa; do -blastx -db /path_to_db -query $input -out $input.blastx_output; done

Then concantentate all *.blastx_output > Final_BlastxOutput.blastx_output

Thank you for the very interesting information about parallel on Bio Linux. Would parallel work well for de-novo assemblers like Velvet and Spades (as examples)? Especially Velvet after reading about: https://www.biostars.org/p/86907/ 

Also would creating multiple databases of the same database with a different name/title. Will that go around the problem of accessing the same database and memory problems when trying to run multiple BLASTx.  I know it is not recommended, would this quasi method be any beneficial to do. Should I just stick with the script above or the script that Tim kindly shared? 

For example: 

Folder A 
blastx -db /path_to_db01 -infile input_seq_001-100 -out ouput_seq_001-100.blastx_output
blastx -db /path_to_db01 -infile input_seq_101-200 -out ouput_seq_101-200.blastx_output
etc to
blasts -db /path_to_db01 -infile input_seq_401-499 -out ouput_seq_401-499.blastx_output

Folder B: 
blastx -db /path_to_db02 -infile input_seq_501-600 -out ouput_seq_501-600.blastx_output
blastx -db /path_to_db02 -infile input_seq_601-700 -out ouput_seq_601-700.blastx_output
etc
blastx -db /path_to_db02 -infile input_seq_901-1000 -out ouput_seq_901-1000.blastx_output

On a side note how is BLASTX from BLAST+ package compared MPI-BLAST? I thought MPI-BLAST is based on the older version of BLAST hence it might return fewer results. This is our major concern as I am going for tabular output format with all sequence titles and information (-outfmt 6 salltitles) This will be helpful for filtering the viral genome for by using some simple grep -w filtering techniques for the contigs. 

Also there is some interesting points about using xargs to parallelize BLAST+ (the last example): https://www.biostars.org/p/76009/ Has anyone tried this?

Thank you Prash for the recommendation for mpich. Its definitely interesting on how it works.  My mentor and I are trying to accomplish this on  a 32 Thread Workstation (Intel Xeon E5-2640v2 (16 cores)) with 128 GB of RAM for Viral Genome that I am planning on using BLASTX across the Viral refseq Protein sequences from NCBI. 

Thank you Dr. Beall. If you don't mind sharing, I would definitely be interested in taking look and trying to see how the script is like. Many thanks.  If I am able to successfully hack the script, I am more than willing to share it with rest of the community. 

Thank you again Andreas, Tim, Tony, Dr. Beall, and Prash. I really appreciate all the suggestions and help. 

Kind regards,

Zain 

________________________________________
From: Tony Travis <tony.travis at minke-informatics.co.uk>
Sent: Tuesday, May 5, 2015 12:19 PM
To: bio-linux at nebclists.nerc.ac.uk
Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files

On 05/05/15 16:08, Tim Booth wrote:
> [...]
> You want to run:
>
> blastx -db foo -infile seqs_000000_to_000999.fsa -out seqs_000000_to_000999.blastx
> ...then...
> blastx -db foo -infile seqs_001000_to_001999.fsa -out seqs_001000_to_001999.blastx
> ...then...
> blastx -db foo -infile seqs_002000_to_002999.fsa -out seqs_002000_to_002999.blastx
> ...then...
> blastx -db foo -infile seqs_003000_to_003999.fsa -out seqs_003000_to_003999.blastx
> ...etc
> [...]

Hi, Tim.

It's not good to run multiple instances of BLAST on the same machine
because each invocation of BLAST will have a copy of the same database
stored in memory. MPI-BLAST avoids this by loading different parts of
the database into each worker process.

The time-consuming part of BLAST is the initial exact word match and
both the old and new versions of BLAST allow you to specify how many
threads to run to speed this up:

  BLAST  uses "-a nn"
  BLAST+ uses "-num_threads nn"

I compared "blastall", "blastn", "blat", "pblat" and "bowtie" for
mapping microRNA and mRNA to a custom database in:

Travis, A. J., Moody, J., Helwak, A., Tollervey, D., & Kudla, G. (2013).
Hyb: A bioinformatics pipeline for the analysis of CLASH (crosslinking,
ligation and sequencing of hybrids) data. Methods (San Diego, Calif.).
http://doi.org/10.1016/j.ymeth.2013.10.015

["pblat" is a parallel/multi-threaded version of BLAT]

You will need a script like this one by Jonathan Moody to convert
"bowtie2" alignments to equivalent tabular BLAST output:

  https://github.com/gkudla/hyb/blob/master/bin/sam2blast

Bye,

  Tony.

--
Minke Informatics Limited, Registered in Scotland - Company No. SC419028
Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK)
tel. +44(0)19755 63548                    http://minke-informatics.co.uk
mob. +44(0)7985 078324        mailto:tony.travis at minke-informatics.co.uk
_______________________________________________
Bio-Linux mailing list
Bio-Linux at nebclists.nerc.ac.uk
http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux