Most of the time the contaminating
species are known ahead of time and used for database(s) assembly.
I would like to use DeconSeq to remove contaminants from human WGS data derived from Saliva.
I started out by downloading the Human Oral Microbiome Database (HOMD) and
testing this out on a proband first. There are ~1,900 genomes here resulting in 4.0Gb FASTA file which needed to be split in order to build indexs with BWA. Now, I would like to perform a more
exhaustive search for contaminants, including those which may not be known beforehand.
What is the best way of doing this? Is there a FASTA file somewhere which has all of the known genomes (I could then subset by removing human and search for everything else).