[CD-HIT] clustering nt database

Sun Sep 13 23:46:15 EDT 2009

Dan Bolser wrote:
> 2009/9/14 Ryan Golhar <golharam at umdnj.edu>:
>>>> I'm using cd-hit-est because the documentation says thats the only one
>>>> that
>>>> works on DNA sequences.  The documentation talks about protein sequences
>>>> for
>>>> the rest of the programs.  Is this not the case?
>>> Right. What a bad memory I have! It's been some years since I used
>>> cd-hit, and I forgot that it is protein specific.
>>>
>>> It may be worth trying to run it anyway... I'd imagine that the k-mer
>>> analysis is still sound on DNA strings.
>>>
>>>
>> Here is what I am getting when I try to use cd-hit:
>>
>> [golharam at hydrogen cd-hit-2009-0427]$ ./cd-hit -i /tmp/nt.1000 -o /tmp/nt90
>> -c 0.9
>> total seq: 1000
>>
>> Warning
>> Some seqs longer than 65536, you may define LONG_SEQ
>>
>> It is not fatal, but may affect your results !!
>>
>> longest and shortest : 163353 and 21
>> Total letters: 13083790
>> Sequences have been sorted
>>
>> Fatal Error
>> in diag_test_aapn, MAX_DIAG reached
>>
>> Program halted !!
>>
>> I did define LONG_SEQ in cd-hi.h, but still get the same error.  I suspect
>> the sequences are just too long.
> 
> I guess this could be a problem. Its not a general solution, but IIRC
> only a tiny fraction of protein sequences are > 65536 ... is this true
> with your data? Would it be hard to remove them?
> 
> If you have bioperl installed it should be straightforward to do.
> However, if your dataset is a set of many large nucleotide sequences,
> I guess this isn't an option.
> 
> 
> Dan.
> 
> 
>> Ryan
>>
> 

I'm running this on the nt database.  This probably isn't the right 
approach the more I think about it.  I'll need to give this some thought.