process_radtags 'ambiguous barcode drops' paired-end

jthm...@ucdavis.edu

unread,

Aug 20, 2013, 7:39:52 PM8/20/13

to stacks...@googlegroups.com

Hi!

I appear to have a large portion of my reads dropped due to ambiguous barcode drops- the 96 barcodes appear to be formatted correctly and load, but my paired end process_radtags leaves me with a mere 50% of my reads. It looks like there have been other individuals who have run into this issue due to an outdated version- but I should be all up to date.

process_radtags -P -1 ./RUN1_R1_001.fastq -2 RUN1_R1_002.fastq -o /home/jmiller1/Jeffrey/Stacks/raw -b ../barcodes/barcodes_c1.txt -e sbfI -E phred64 -r -c -q

407182266 total sequences;

129847986 ambiguous barcode drops;

15105390 low quality read drops;

54071646 ambiguous RAD-Tag drops;

208157244 retained reads.

Here is what a typical individual looks like:

jmiller1@farm:~/Jeffrey/Stacks/raw$ wc -l sample_CCATTT.*

3851384 sample_CCATTT.1.fq

3851384 sample_CCATTT.2.fq

375364 sample_CCATTT.rem.1.fq

2819552 sample_CCATTT.rem.2.fq

10897684 total

Any help would be greatly appreciated!!

Thanks,

Jeff

jthm...@ucdavis.edu

unread,

Aug 22, 2013, 4:25:10 PM8/22/13

to stacks...@googlegroups.com

I ran this with a barcode distance of 3, and

407182266 total sequences;

8550 ambiguous barcode drops;

19206054 low quality read drops;

118658268 ambiguous RAD-Tag drops;

269309394 retained reads.

Which increased my reads by 50 million- but the fact that my low read quality and ambig RAD-Tag drops increased makes me think that they were in fact low quality-

Julian Catchen

unread,

Aug 23, 2013, 8:04:23 PM8/23/13

to stacks...@googlegroups.com

Hi Jeff,

What is the significance of the two files you have listed?

RUN1_R1_001.fastq

and

RUN1_R1_002.fastq

The way illumina files are normally named is that the single end is named "_R1_"
and the paired-end is named "_R2_", Each tile on the sequencer then produces one
set of files, "001", "002", etc.

You are running process_radtags on a pair of files, but are they the properly
paired files? process_radtags expects '-1' and '-2' command line options to
correspond to a set of single-end and paired-end data, respectively.

If you have a whole lane of Illumina data, which is the normal case you should
use the "-P -p" options which will cause process_radtags to read a directory of
files and match them up in pairs (based on the naming scheme I described earler).

Some other possibilities:

If your data are relatively new, they are likely to be encoded as phred33, not
phred64.

Finally, is there a consistent error in your single-end reads. Sometimes the
Illumina machine will produce an uncalled base, over and over in the same
position in almost all the reads. If something like this was occurring in the
RAD cut site area, you would see a lot of drops due to the cut site not being
intact.

Best,

julian

jthm...@ucdavis.edu

unread,

Aug 26, 2013, 3:52:53 PM8/26/13

to stacks...@googlegroups.com, jcat...@uoregon.edu

Hello Julian,

Thanks for the help. I re-adjusted the barcode distance correctly to three- but I still can not figure out why around 25% are dropping due to ambiguous RAD-Tag drops.

The files should be properly paired- I just attempted to name them according to the expected naming scheme- but it looks like I should change it. For some reason, BGI used the old phred 64 quality score for this set- we are not sure why.

I may be a little confused on the -p -P part: 001 is my single end, and 002 is my paired end reads. They are an entire lane, but when I try to use -p -P, it reads error 'You must specify either a file path (-p) or a set of paired files (-1, -2), not both.' Am I still using it incorrectly? I adjusted it as reflected in the sh script below.

My output files look like:

sample_CGAGGC.1.fq

sample_CGAGGC.2.fq

sample_CGAGGC.rem.1.fq

sample_CGAGGC.rem.2.fq

I think that this might be the correct output?

Here is what it reads out when I adjusted the names:

Using Phred+64 encoding for quality scores.

Found 1 paired input file(s).

Searching for single-end, inlined barcodes.

Loaded 96 barcodes (6bp).

Processing file 1 of 1 [RUN1_R1_001.fastq]

Reading data from:

./RUN1_R1_001.fastq and

./RUN1_R2_001.fastq

process_radtags -P -1 ./RUN1_R1_001.fastq -2 ./RUN1_R2_001.fastq -o /home/jmiller1/Jeffrey/Stacks/raw2 -b ../barcodes/barcodes_c1.txt -e sbfI -E phred64 -D -r -c -q --barcode_dist 3

I have two lanes total- and there is a similar proportion of ambiguous RAD-Tag drops- I will double check a consistent sequencing error- but it does not look like that is happening.

407182266 total sequences;

8550 ambiguous barcode drops;

19206054 low quality read drops;

118658268 ambiguous RAD-Tag drops;

269309394 retained reads.

Thanks!!

Julian Catchen

unread,

Aug 26, 2013, 8:29:10 PM8/26/13

to stacks...@googlegroups.com, jthm...@ucdavis.edu

Hi Jeff,

Some answers below.

On 8/26/13 12:52 PM, jthm...@ucdavis.edu wrote:
> Thanks for the help. I re-adjusted the barcode distance correctly to three- but
> I still can not figure out why around 25% are dropping due to ambiguous RAD-Tag
> drops.
>
> The files should be properly paired- I just attempted to name them according to
> the expected naming scheme- but it looks like I should change it. For some
> reason, BGI used the old phred 64 quality score for this set- we are not sure why.
>
> I may be a little confused on the -p -P part: 001 is my single end, and 002 is
> my paired end reads. They are an entire lane, but when I try to use -p -P, it
> reads error 'You must specify either a file path (-p) or a set of paired files
> (-1, -2), not both.' Am I still using it incorrectly? I adjusted it as reflected
> in the sh script below.

The -P option, where process_radtags will read files out of a directory relies
on the files being named consistently and it uses the _R1_ and _R2_ parts of the
file name to match up the sequencing pairs. So, you can rename your files to use
this scheme, or you can run each pair of files separately (with -1 and -2) and
then merge the resulting output files afterwards.

> My output files look like:
> sample_CGAGGC.1.fq
> sample_CGAGGC.2.fq
> sample_CGAGGC.rem.1.fq
> sample_CGAGGC.rem.2.fq
> I think that this might be the correct output?

Yes, that's correct.

>
> Here is what it reads out when I adjusted the names:
> Using Phred+64 encoding for quality scores.
> Found 1 paired input file(s).
> Searching for single-end, inlined barcodes.
> Loaded 96 barcodes (6bp).
> Processing file 1 of 1 [RUN1_R1_001.fastq]
> Reading data from:
> ./RUN1_R1_001.fastq and
> ./RUN1_R2_001.fastq
>

Looks right. You can verify it by checking that the FASTQ ID of the first read
in RUN1_R1_001.fastq matches the FASTQ ID of the first read in RUN1_R2_001.fastq
and similarly for the second pair of reads, the third and so on.

> process_radtags -P -1 ./RUN1_R1_001.fastq -2 ./RUN1_R2_001.fastq -o
> /home/jmiller1/Jeffrey/Stacks/raw2 -b ../barcodes/barcodes_c1.txt -e sbfI -E
> phred64 -D -r -c -q --barcode_dist 3
>

I can't speak to why you have errors in your RAD cut site. One approach is to
look at the distribution of errors you have in your data, this shell code will
look at the RAD cut site in a data file and tell you the types of errors seen.
It is possible that you simply have a lot of sequencing error in bases 7 through
12 (the cut site). In my own experience, I have instead seen a single base
repeatedly not sequenced in every read, which this code would demonstrate:

This command is all on one line in the terminal:

zcat lane4_NoIndex_L004_R1_001.fastq.gz |
grep -A 1 "^@HWI" |
grep -v "^@HWI" |
grep -v -- -- |
cut -c 7-12 |
sort |
uniq -c

Here is a fraction of the output on one of my data files when I run the command:
...
499 TGATTA
347 TGATTC
447 TGATTG
635 TGATTT
4475 TGCAAA
2303 TGCAAC
3848 TGCAAG
4686 TGCAAT
10267 TGCACA
4912 TGCACC
4758 TGCACG
9145 TGCACT
46692 TGCAGA
11011 TGCAGC
14344917 TGCAGG
2 TGCAGN
26290 TGCAGT
1673 TGCATA
1739 TGCATC
5582 TGCATG
2438 TGCATT
705 TGCCAA
601 TGCCAC
681 TGCCAG
657 TGCCAT
597 TGCCCA
671 TGCCCC
364 TGCCCG
...

If you had a consistent "N" in your RAD cutsite it would show up at high
prevalence to the correct SbfI cutsite (TGCAGG).

Of course, you can disable checking the RAD site with the --disable_rad_check flag.

> I have two lanes total- and there is a similar proportion of ambiguous RAD-Tag
> drops- I will double check a consistent sequencing error- but it does not look
> like that is happening.
>
> 407182266 total sequences;
> 8550 ambiguous barcode drops;
> 19206054 low quality read drops;
> 118658268 ambiguous RAD-Tag drops;
> 269309394 retained reads.
>
> Thanks!!
>

Best,

julian

Sundy Maurice

unread,

Feb 11, 2015, 8:38:21 AM2/11/15

to stacks...@googlegroups.com

Hi,

I also have the same problem, even worst that you Jeffrey Miller. I tried to run the command using the P and -p rather than the 2 files -1 and -2 and the job is processing. Where do you specify the barcode length ? is it the option t ?