[CD-HIT] clustering nt database

Ryan Golhar golharam at umdnj.edu
Sun Sep 13 23:32:16 EDT 2009


>> I'm using cd-hit-est because the documentation says thats the only one that
>> works on DNA sequences.  The documentation talks about protein sequences for
>> the rest of the programs.  Is this not the case?
> 
> Right. What a bad memory I have! It's been some years since I used
> cd-hit, and I forgot that it is protein specific.
> 
> It may be worth trying to run it anyway... I'd imagine that the k-mer
> analysis is still sound on DNA strings.
> 
> 

Here is what I am getting when I try to use cd-hit:

[golharam at hydrogen cd-hit-2009-0427]$ ./cd-hit -i /tmp/nt.1000 -o 
/tmp/nt90 -c 0.9
total seq: 1000

Warning
Some seqs longer than 65536, you may define LONG_SEQ

It is not fatal, but may affect your results !!

longest and shortest : 163353 and 21
Total letters: 13083790
Sequences have been sorted

Fatal Error
in diag_test_aapn, MAX_DIAG reached

Program halted !!

I did define LONG_SEQ in cd-hi.h, but still get the same error.  I 
suspect the sequences are just too long.

Ryan



More information about the CD-HIT-l mailing list