[CD-HIT] clustering nt database

Dan Bolser dan.bolser at gmail.com
Fri Sep 11 20:43:18 EDT 2009


2009/9/11 Ryan Golhar <golharam at umdnj.edu>:
> I'm using cd-hit-est because the documentation says thats the only one that
> works on DNA sequences.  The documentation talks about protein sequences for
> the rest of the programs.  Is this not the case?

Right. What a bad memory I have! It's been some years since I used
cd-hit, and I forgot that it is protein specific.

It may be worth trying to run it anyway... I'd imagine that the k-mer
analysis is still sound on DNA strings.


> Dan Bolser wrote:
>>
>> Hi Ryan,
>>
>> I think you should be using cd-hit, not cd-hit-est (if I guess from
>> the name correctly that cd-hit-est is designed for clustering ESTs).
>>
>> The error would make sense in this case, as protein sequences can be
>> very very long (i.e. titin), but ESTs are typically very short.
>>
>> Sorry that I am not up to speed with the latest releases of cd-hit,
>> but why are you not running the 'cd-hit' binary?
>>
>>
>> Dan.
>>
>> 2009/9/10 Ryan Golhar <golharam at umdnj.edu>:
>>>
>>> Hi,
>>>
>>> How do I go about clustering the nt database?
>>>
>>> When I run
>>>
>>> cd-hit-est -i /usr/local/ncbi/db/nt -o /tmp/nt90 -c 0.90 -n 8
>>>
>>> I get the error:
>>>
>>> Fatal Error
>>> Too long sequence found, enlarge Macro MAX_SEQ
>>>
>>> Program halted !!
>>>
>>> What do I enlarge MAX_SEQ to?
>>>
>>> _______________________________________________
>>> CD-HIT-l mailing list
>>> CD-HIT-l at bioinformatics.org
>>> http://www.bioinformatics.org/mailman/listinfo/cd-hit-l
>>>
>>
>



More information about the CD-HIT-l mailing list