Hello Maria,
Thanks for posting that tax file. I wanted to check that first to see if it's formatting could be causing a problem, and I think that could be the issue!
Here is what a valid format looks like (from greengenes)
367523 k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; o__Flavobacteriales; f__Flavobacteriaceae; g__Flavobacterium; s__
187144 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__
836974 k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Cercozoa; f__; g__; s__
310669 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__
Here is how that taxonomy looks:
AJ000684|S000004347 Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Mycobacteriaceae;Mycobacterium;
EF599163|S000871589 Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Vibrio;
AY859683|S000631792 Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Mycobacteriaceae;Mycobacterium;
In your reference database, are the reads named 'AJ000684|S000004347' or just 'AJ000684'? If this full name with the bar | symbol | is used then I think this looks good. And retraining is needed.
Searching for matching text is a hard problem in computer science. Searching for text that is almost matching is harder, and clever algorithms have been developed to make this process faster. When you try to match the DNA sequence from your OTUs to reads in a database, this is a text search problem, so we can make use of these algorithms.
Both the RDP and uclust methods make use of a process called k-mer counting as part of their algorithm. The k-mer method make use of string of DNA of length k; so ATG is three letters long and would be a 3mer, while ACTCGTAA is eight letters long and would be called an 8mer. How does counting 8mers or 3mers speed up searches? Take a look at how the uclust algorithm uses it:
The RDP classifier also uses a k-mer method, and these k-mers need to be counted before the search can begin. The RDP devs call this process 'retraining,' but you could also call it 'indexing.'
Before we dive into this, I want to check with a qiime dev about the best way to do this.
Keep in touch!
Colin