[CD-HIT] clustering nt database
Ryan Golhar
golharam at umdnj.edu
Sun Sep 13 23:46:15 EDT 2009
Dan Bolser wrote:
> 2009/9/14 Ryan Golhar <golharam at umdnj.edu>:
>>>> I'm using cd-hit-est because the documentation says thats the only one
>>>> that
>>>> works on DNA sequences. The documentation talks about protein sequences
>>>> for
>>>> the rest of the programs. Is this not the case?
>>> Right. What a bad memory I have! It's been some years since I used
>>> cd-hit, and I forgot that it is protein specific.
>>>
>>> It may be worth trying to run it anyway... I'd imagine that the k-mer
>>> analysis is still sound on DNA strings.
>>>
>>>
>> Here is what I am getting when I try to use cd-hit:
>>
>> [golharam at hydrogen cd-hit-2009-0427]$ ./cd-hit -i /tmp/nt.1000 -o /tmp/nt90
>> -c 0.9
>> total seq: 1000
>>
>> Warning
>> Some seqs longer than 65536, you may define LONG_SEQ
>>
>> It is not fatal, but may affect your results !!
>>
>> longest and shortest : 163353 and 21
>> Total letters: 13083790
>> Sequences have been sorted
>>
>> Fatal Error
>> in diag_test_aapn, MAX_DIAG reached
>>
>> Program halted !!
>>
>> I did define LONG_SEQ in cd-hi.h, but still get the same error. I suspect
>> the sequences are just too long.
>
> I guess this could be a problem. Its not a general solution, but IIRC
> only a tiny fraction of protein sequences are > 65536 ... is this true
> with your data? Would it be hard to remove them?
>
> If you have bioperl installed it should be straightforward to do.
> However, if your dataset is a set of many large nucleotide sequences,
> I guess this isn't an option.
>
>
> Dan.
>
>
>> Ryan
>>
>
I'm running this on the nt database. This probably isn't the right
approach the more I think about it. I'll need to give this some thought.
More information about the CD-HIT-l
mailing list