[Bio-Linux] Blasting Multiple Fasta Files

Zain A Alvi zain.alvi at student.shu.edu
Wed May 6 12:16:02 EDT 2015


Hi Everyone,

Thank you for the great explanation Andreas. I apologize for my typological mistake with '-' in front of the blastx. 

So would the two options would be something like this

The Script route:

for input in *.fa; do blastx -db /path_to_db -query $input -num_threads 30 -evalue 0.001 -outputfmt 6 salltitles -out $input.blastx_output; done

If I go for the parallel route, I have never tried it before: 

cat input.fa | time parallel  -j+0 --eta --progress --block 100 --recstart '>' --pipe blastx -evalue 0.001 -outfmt 6 salltitles -db path_to_db -query - > final_results.blastx_output

This will break the sequence into 100 sequences. How would I use -j+0 to make sure it only uses 30 of the 32 threads? Currently, the -j+0 will use all 32 threads. Would something like -j+2 will work? I saw this in the parallel GNU videos on youtube: 
https://www.youtube.com/watch?v=OpaiGYxkSuQ&list=PL284C9FF2488BC6D1&index=1&spfreload=10 and command breakdown that from the biostar link shared by Martin, Andreas, and Tim: https://www.biostars.org/p/63816/

But the worrisome part is about the parallel losing sequences here: http://seqanswers.com/forums/showthread.php?t=48879 Has anyone here experienced this?

 Would I still use -num_threads with parallel? I have never used parallel before. Hence all these questions and trying to self teach myself the tools. 

In the second option with parallel as kindly shared by Tim, which I slightly modified to what I am hoping to do. 

ls *.fasta | time parallel  -j+0 --eta --progress --res out blastx -evalue 0.001 -outfmt 6 salltitles -db path_to_db -query

>Then to see what files were outputted:

>$ find out -name stdout

I was wondering what does --res after parallel indicates? Will there be an easier method concatenate all the files by giving them some endings, but where would that be? Would that be something like this? 

ls *.fasta | time parallel  -j+0 --eta --progress --res out.blastx_output blastx -evalue 0.001 -outfmt 6 salltitles -db path_to_db -query

Then concatenate the *.blastx_output to final_results.blastx_output

On a smaller note, I received zsh command not found when I typed parallel --version. Do I need reinstall parallel or do I need to add the location of where parallel is pre-installed in ./zshrc? Where is this location? I have checked usr/bin and there is no parallel, but there are files for parallel-fasts and parallel-fastq files. 

Sorry for all these novice questions. I am trying to teach myself all these tools and strategies such as parallel. 

Many thanks to everyone. I sincerely appreciate it. 

Kind regards,

Zain 

________________________________________
From: Andreas Leimbach <aleimba at gwdg.de>
Sent: Wednesday, May 6, 2015 3:58 AM
To: Bio-Linux help and discussion
Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files

Hi,

if you lose the "-" before blastx it will work:
for input in *.fa; do blastx -db /path_to_db -query $input -out
$input.blastx_output; done

And as Tony/Martin recommended you should really use '-num_threads'. The
blast+ routines should be faster than legacy blast and you want the
extra output option anyway. Still parallel will be faster than the loop.

Having different databases won't make a difference, all will be held in
memory anyway.

I don't think you can run a single assembly through parallel. The
assembler has to look at the whole data. Anyway, assembly algorithms are
designed for parallel thread usage anyway, they all have an *option* how
many threads you want to use (in the case of velvet through OpenMP). For
Illumina data I'd recommend SPAdes, it has a nice workflow (including
error correction etc.) and thus is quite user-friendly.

The xargs example won't give you anything that parallel can't do.

mpiBLAST is mainly meant for clustered computers (i.e. several servers
being used for a single program run). IMO, it won't give you a speed
advantage on a single computer with several cores in comparison to the
aforementioned possibilities.

HTH,
Andreas

--
Andreas Leimbach
Universität Münster
Institut für Hygiene
Mendelstr. 7
D-48149 Münster
Germany

Tel.: +49 (0)551 39 33843
E-Mail: aleimba at gwdg.de

On 06.05.2015 06:33, Zain A Alvi wrote:
> Hi Everyone,
>
> Thank you for all the great and helpful recommendations, especially Tim, Tony, Dr. Beall, and Andreas.  I am trying to do exactly what Tim has showed and having BLASTx run on each fasta file one at a time, but not at the same time.  It should go through each fasta file one at a time and do BLASTX and then move onto next fasta file until there are no fasta files left in the folder.
>
> Would something like this work as well:
>
> for input in *.fa; do -blastx -db /path_to_db -query $input -out $input.blastx_output; done
>
> Then concantentate all *.blastx_output > Final_BlastxOutput.blastx_output
>
> Thank you for the very interesting information about parallel on Bio Linux. Would parallel work well for de-novo assemblers like Velvet and Spades (as examples)? Especially Velvet after reading about: https://www.biostars.org/p/86907/
>
> Also would creating multiple databases of the same database with a different name/title. Will that go around the problem of accessing the same database and memory problems when trying to run multiple BLASTx.  I know it is not recommended, would this quasi method be any beneficial to do. Should I just stick with the script above or the script that Tim kindly shared?
>
> For example:
>
> Folder A
> blastx -db /path_to_db01 -infile input_seq_001-100 -out ouput_seq_001-100.blastx_output
> blastx -db /path_to_db01 -infile input_seq_101-200 -out ouput_seq_101-200.blastx_output
> etc to
> blasts -db /path_to_db01 -infile input_seq_401-499 -out ouput_seq_401-499.blastx_output
>
> Folder B:
> blastx -db /path_to_db02 -infile input_seq_501-600 -out ouput_seq_501-600.blastx_output
> blastx -db /path_to_db02 -infile input_seq_601-700 -out ouput_seq_601-700.blastx_output
> etc
> blastx -db /path_to_db02 -infile input_seq_901-1000 -out ouput_seq_901-1000.blastx_output
>
> On a side note how is BLASTX from BLAST+ package compared MPI-BLAST? I thought MPI-BLAST is based on the older version of BLAST hence it might return fewer results. This is our major concern as I am going for tabular output format with all sequence titles and information (-outfmt 6 salltitles) This will be helpful for filtering the viral genome for by using some simple grep -w filtering techniques for the contigs.
>
> Also there is some interesting points about using xargs to parallelize BLAST+ (the last example): https://www.biostars.org/p/76009/ Has anyone tried this?
>
> Thank you Prash for the recommendation for mpich. Its definitely interesting on how it works.  My mentor and I are trying to accomplish this on  a 32 Thread Workstation (Intel Xeon E5-2640v2 (16 cores)) with 128 GB of RAM for Viral Genome that I am planning on using BLASTX across the Viral refseq Protein sequences from NCBI.
>
> Thank you Dr. Beall. If you don't mind sharing, I would definitely be interested in taking look and trying to see how the script is like. Many thanks.  If I am able to successfully hack the script, I am more than willing to share it with rest of the community.
>
> Thank you again Andreas, Tim, Tony, Dr. Beall, and Prash. I really appreciate all the suggestions and help.
>
> Kind regards,
>
> Zain
>
> ________________________________________
> From: Tony Travis <tony.travis at minke-informatics.co.uk>
> Sent: Tuesday, May 5, 2015 12:19 PM
> To: bio-linux at nebclists.nerc.ac.uk
> Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files
>
> On 05/05/15 16:08, Tim Booth wrote:
>> [...]
>> You want to run:
>>
>> blastx -db foo -infile seqs_000000_to_000999.fsa -out seqs_000000_to_000999.blastx
>> ...then...
>> blastx -db foo -infile seqs_001000_to_001999.fsa -out seqs_001000_to_001999.blastx
>> ...then...
>> blastx -db foo -infile seqs_002000_to_002999.fsa -out seqs_002000_to_002999.blastx
>> ...then...
>> blastx -db foo -infile seqs_003000_to_003999.fsa -out seqs_003000_to_003999.blastx
>> ...etc
>> [...]
>
> Hi, Tim.
>
> It's not good to run multiple instances of BLAST on the same machine
> because each invocation of BLAST will have a copy of the same database
> stored in memory. MPI-BLAST avoids this by loading different parts of
> the database into each worker process.
>
> The time-consuming part of BLAST is the initial exact word match and
> both the old and new versions of BLAST allow you to specify how many
> threads to run to speed this up:
>
>   BLAST  uses "-a nn"
>   BLAST+ uses "-num_threads nn"
>
> I compared "blastall", "blastn", "blat", "pblat" and "bowtie" for
> mapping microRNA and mRNA to a custom database in:
>
> Travis, A. J., Moody, J., Helwak, A., Tollervey, D., & Kudla, G. (2013).
> Hyb: A bioinformatics pipeline for the analysis of CLASH (crosslinking,
> ligation and sequencing of hybrids) data. Methods (San Diego, Calif.).
> http://doi.org/10.1016/j.ymeth.2013.10.015
>
> ["pblat" is a parallel/multi-threaded version of BLAT]
>
> You will need a script like this one by Jonathan Moody to convert
> "bowtie2" alignments to equivalent tabular BLAST output:
>
>   https://github.com/gkudla/hyb/blob/master/bin/sam2blast
>
> Bye,
>
>   Tony.
>
> --
> Minke Informatics Limited, Registered in Scotland - Company No. SC419028
> Registered Office: 3 Donview, Bridge of Alford, AB33 8QJ, Scotland (UK)
> tel. +44(0)19755 63548                    http://minke-informatics.co.uk
> mob. +44(0)7985 078324        mailto:tony.travis at minke-informatics.co.uk
> _______________________________________________
> Bio-Linux mailing list
> Bio-Linux at nebclists.nerc.ac.uk
> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
> _______________________________________________
> Bio-Linux mailing list
> Bio-Linux at nebclists.nerc.ac.uk
> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
>
_______________________________________________
Bio-Linux mailing list
Bio-Linux at nebclists.nerc.ac.uk
http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux


More information about the Bio-linux-list mailing list