find and remove reads with duplicate seq IDs

995 views
Skip to first unread message

Stef

unread,
Nov 20, 2013, 10:18:30 AM11/20/13
to qiime...@googlegroups.com
Hi,
In my dataset I have of a couple of hundred samples and I have accidentally included one (or probably more) samples twice. I therefore get an error like this:
Stderr
Traceback (most recent call last):
  File "/home/ubuntu/qiime_software/
qiime-1.7.0-release/bin/pick_otus.py", line 771, in <module>
    main()
  File "/home/ubuntu/qiime_software/qiime-1.7.0-release/bin/pick_otus.py", line 601, in main
    result_path=result_path,log_path=log_path,HALT_EXEC=False)
  File "/home/ubuntu/qiime_software/qiime-1.7.0-release/lib/qiime/pick_otus.py", line 884, in __call__
    HALT_EXEC=HALT_EXEC)
  File "/home/ubuntu/qiime_software/pycogent-1.5.3-release/lib/python2.7/site-packages/cogent/app/uclust.py", line 560, in get_clusters_from_fasta_filepath
    clusters_from_uc_file(uclust_cluster['ClusterFile'])
  File "/home/ubuntu/qiime_software/pycogent-1.5.3-release/lib/python2.7/site-packages/cogent/app/uclust.py", line 291, in clusters_from_uc_file
    "Offending seq id is %s" % query_id)
cogent.app.uclust.UclustParseError: A seq id was provided as a seed, but that seq id already represents a cluster. Are there overlapping seq ids in your reference and input files or repeated seq ids in either? Offending seq id is QiimeExactMatch.1.046.02_0
Is there a command one can run in advance to check which samples may occur twice and then remove one of the duplications? I do not want to remove all seqs from this sample (1.046.02) manually in a 2Gb file just to start the analysis again and get the same error for another sample, remove this and get the error for the next and so on...

Thank you for your help!
Cheers,
Stef

arp

unread,
Nov 20, 2013, 4:13:59 PM11/20/13
to qiime...@googlegroups.com
Hi Stef,

Unfortunately, I do not think QIIME scripts can do this for you; you will probably need to write a script to do it.  However, I'm not entirely sure what you mean when you say that you included some samples twice. Do you mean that you duplicated the sequences in the FASTA file?

Adam


--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tony Walters

unread,
Nov 20, 2013, 7:00:58 PM11/20/13
to qiime...@googlegroups.com
Stef, there is a custom script here: https://gist.github.com/walterst/7573475
that might help you (it won't work if the sequence labels aren't of the format of Sample_# at the beginning of the labels though).

-Tony

Stefanie Prast-Nielsen

unread,
Nov 21, 2013, 3:16:47 AM11/21/13
to qiime...@googlegroups.com
Thank you Tony! The format of Sample_# means I may not have any '.' in my sample ID, I guess, as the script did not remove my duplicate samples. My sampleID format is x.xxx.xx, e.g. 1.046.02 appended by _#.
How can I change this format to xxxxxx, eg. 104602 and back after running the script or, probably easier, change the script to accept my format? It is always the same format: one digit, a dot, three digits, a dot and then two digits.
Thank you again!
/Stef


You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/QBtbdp3pQVA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.

Tony Walters

unread,
Nov 21, 2013, 8:41:23 AM11/21/13
to qiime...@googlegroups.com
Stef, it's only going to replace the number following the _ character with a new one, so if the first few labels were:
1.2323.1_20  <remainder of header>
1.412.2_100   <remainder of header>
1.2323.1_20  <remainder of header>

The output should be
1.2323.1_0  <remainder of header>
1.412.2_1  <remainder of header>
1.2323.1_2  <remainder of header>

The SampleIDs are specified in your mapping file, so when you run split_libraries.py or split_libraries_fastq.py, it should be using the barcode that it identifies to match up the SampleID and write out whichever ID is there. It will probably be easier if you go back to the split_libaries step, make sure your Mapping file has the SampleIDs that you want, and rerun it there to get the right IDs and unique numbers following the underscores after the IDs. If you have to call split_libaries multiple, independent times, you want to use the -n parameter to give it a new starting number each time, e.g. -n 1000000 the first call, -n 2000000 the second and so on (much larger numbers if you are demultiplexing Illumina data).
Reply all
Reply to author
Forward
0 new messages