[Bio-Linux] Blasting Multiple Fasta Files

Clifford Beall cliffbeall at gmail.com
Tue May 5 11:17:10 EDT 2015


I have a bash script, written by a previous colleague, that splits up queries then generates blast commands and parallelizes them through xargs.

It does speed up the process a lot, depending on how many cores you have.

It would require some hacking for your use case since the splitting is kind of idiosyncratic, it’s doing a nucleotide blast, and we then post-process the blast results which you would not need.

So you might be better off starting from scratch but let me know if you want to take a look at it.



Clifford Beall, PhD, MSc
cliffbeall at gmail.com <mailto:cliffbeall at gmail.com>
beall.3 at osu.edu <mailto:beall.3 at osu.edu>
Research Assistant Professor
Division of Biosciences
Ohio State U. College of Dentistry



> 
> Message: 4
> Date: Tue, 5 May 2015 16:54:59 +0200
> From: Andreas Leimbach <aleimba at gwdg.de>
> To: Bio-Linux help and discussion <bio-linux at nebclists.nerc.ac.uk>
> Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files
> Message-ID: <5548D9C3.7050501 at gwdg.de>
> Content-Type: text/plain; charset="windows-1252"
> 
> Hey,
> 
> blast+ is not parallelized all that well. Thus, you might want to try
> GNU parallel to speed up your calculations somewhat, depending on your
> machine. Here are some links:
> 
> https://www.biostars.org/p/63816/
> https://www.biostars.org/p/76009/
> 
> Cheers,
> Andreas
> 
> 
> Andreas Leimbach
> Universit?t M?nster
> Institut f?r Hygiene
> Mendelstr. 7
> D-48149 M?nster
> Germany
> 
> Tel.: +49 (0)551 39 33843
> E-Mail: aleimba at gwdg.de
> 
> On 05.05.2015 16:31, Zain A Alvi wrote:
>> Hi Marty,
>> 
>> I apologize for the confusion. I am splitting a fasta file that contains approximately 100,000 fasta sequences to 100 fasta files that contains 1000 sequences each.  I am hoping this will expedite the BLASTx process.
>> 
>> 
>> Kind regards,
>> 
>> 
>> Zain
>> 
>> ________________________________
>> From: Martin Gollery <mgollery at unr.edu>
>> Sent: Tuesday, May 5, 2015 10:23 AM
>> To: Bio-Linux help and discussion
>> Subject: Re: [Bio-Linux] Blasting Multiple Fasta Files
>> 
>> Running a million BLASTX jobs on one sequence each is not going to save you time. It is better to run one BLASTX job on a million sequences.
>> 
>> -Marty
>> 
>> 
>> 
>> On Tue, May 5, 2015 at 7:00 AM, Zain A Alvi <zain.alvi at student.shu.edu<mailto:zain.alvi at student.shu.edu>> wrote:
>> 
>> Dear Sir or Madam,
>> 
>> 
>> I hope everything is well. I have downloaded all the viral protein sequences from the NCBI refseq database using their script from their E-book.  I have de-novo assembled some viral genomes and I know BLASTX takes a long time if the fasta is large.  I have been able to split the large fasta file based on an user specified contig number in each new fasta file.
>> 
>> 
>> I was wondering is there a method to run BLASTX automatically on each of the fasta files one at a time so that it will be able to complete in a "shorter" amount of time as compared to BLASTing the whole large de-novo assembled fasta file.  Then I was hoping to concatenate all the results into one file.
>> 
>> 
>> Sincerely,
>> 
>> 
>> Zain
>> 
>> _______________________________________________
>> Bio-Linux mailing list
>> Bio-Linux at nebclists.nerc.ac.uk<mailto:Bio-Linux at nebclists.nerc.ac.uk>
>> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
>> 
>> 
>> 
>> 
>> --
>> --
>> Martin Gollery
>> Senior Bioinformatics Scientist
>> Tahoe Informatics
>> www.bioinformaticist.biz<http://www.bioinformaticist.biz>
>> www.hiddenmarkovmodels.com<http://www.hiddenmarkovmodels.com>
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Bio-Linux mailing list
>> Bio-Linux at nebclists.nerc.ac.uk
>> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
>> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> Bio-Linux mailing list
> Bio-Linux at nebclists.nerc.ac.uk
> http://nebclists.nerc.ac.uk/mailman/listinfo/bio-linux
> 
> 
> ------------------------------
> 
> End of Bio-Linux Digest, Vol 80, Issue 3
> ****************************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bioinformatics.org/pipermail/bio-linux-list/attachments/20150505/282d1966/attachment.html>


More information about the Bio-linux-list mailing list