Hi Jeff,
Some answers below.
On 8/26/13 12:52 PM,
jthm...@ucdavis.edu wrote:
> Thanks for the help. I re-adjusted the barcode distance correctly to three- but
> I still can not figure out why around 25% are dropping due to ambiguous RAD-Tag
> drops.
>
> The files should be properly paired- I just attempted to name them according to
> the expected naming scheme- but it looks like I should change it. For some
> reason, BGI used the old phred 64 quality score for this set- we are not sure why.
>
> I may be a little confused on the -p -P part: 001 is my single end, and 002 is
> my paired end reads. They are an entire lane, but when I try to use -p -P, it
> reads error 'You must specify either a file path (-p) or a set of paired files
> (-1, -2), not both.' Am I still using it incorrectly? I adjusted it as reflected
> in the sh script below.
The -P option, where process_radtags will read files out of a directory relies
on the files being named consistently and it uses the _R1_ and _R2_ parts of the
file name to match up the sequencing pairs. So, you can rename your files to use
this scheme, or you can run each pair of files separately (with -1 and -2) and
then merge the resulting output files afterwards.
> My output files look like:
> sample_CGAGGC.1.fq
> sample_CGAGGC.2.fq
> sample_CGAGGC.rem.1.fq
> sample_CGAGGC.rem.2.fq
> I think that this might be the correct output?
Yes, that's correct.
>
> Here is what it reads out when I adjusted the names:
> Using Phred+64 encoding for quality scores.
> Found 1 paired input file(s).
> Searching for single-end, inlined barcodes.
> Loaded 96 barcodes (6bp).
> Processing file 1 of 1 [RUN1_R1_001.fastq]
> Reading data from:
> ./RUN1_R1_001.fastq and
> ./RUN1_R2_001.fastq
>
Looks right. You can verify it by checking that the FASTQ ID of the first read
in RUN1_R1_001.fastq matches the FASTQ ID of the first read in RUN1_R2_001.fastq
and similarly for the second pair of reads, the third and so on.
> process_radtags -P -1 ./RUN1_R1_001.fastq -2 ./RUN1_R2_001.fastq -o
> /home/jmiller1/Jeffrey/Stacks/raw2 -b ../barcodes/barcodes_c1.txt -e sbfI -E
> phred64 -D -r -c -q --barcode_dist 3
>
I can't speak to why you have errors in your RAD cut site. One approach is to
look at the distribution of errors you have in your data, this shell code will
look at the RAD cut site in a data file and tell you the types of errors seen.
It is possible that you simply have a lot of sequencing error in bases 7 through
12 (the cut site). In my own experience, I have instead seen a single base
repeatedly not sequenced in every read, which this code would demonstrate:
This command is all on one line in the terminal:
zcat lane4_NoIndex_L004_R1_001.fastq.gz |
grep -A 1 "^@HWI" |
grep -v "^@HWI" |
grep -v -- -- |
cut -c 7-12 |
sort |
uniq -c
Here is a fraction of the output on one of my data files when I run the command:
...
499 TGATTA
347 TGATTC
447 TGATTG
635 TGATTT
4475 TGCAAA
2303 TGCAAC
3848 TGCAAG
4686 TGCAAT
10267 TGCACA
4912 TGCACC
4758 TGCACG
9145 TGCACT
46692 TGCAGA
11011 TGCAGC
14344917 TGCAGG
2 TGCAGN
26290 TGCAGT
1673 TGCATA
1739 TGCATC
5582 TGCATG
2438 TGCATT
705 TGCCAA
601 TGCCAC
681 TGCCAG
657 TGCCAT
597 TGCCCA
671 TGCCCC
364 TGCCCG
...
If you had a consistent "N" in your RAD cutsite it would show up at high
prevalence to the correct SbfI cutsite (TGCAGG).
Of course, you can disable checking the RAD site with the --disable_rad_check flag.
> I have two lanes total- and there is a similar proportion of ambiguous RAD-Tag
> drops- I will double check a consistent sequencing error- but it does not look
> like that is happening.
>
> 407182266 total sequences;
> 8550 ambiguous barcode drops;
> 19206054 low quality read drops;
> 118658268 ambiguous RAD-Tag drops;
> 269309394 retained reads.
>
> Thanks!!
>
Best,
julian