Population Analysis with ddRAD, Paired-End reads in single file

677 views
Skip to first unread message

joshua....@gmail.com

unread,
Sep 17, 2014, 12:15:28 PM9/17/14
to stacks...@googlegroups.com

Hi there, 

I've got stacks up and running and have run through all of the tutorials, but now I'm trying to process some real data and I'm feeling a bit out of my depth! I'm running into some issues in the denovo_map.pl step and I think it may be a fundamental issue with the structure of my data files. Sorry for the long post, but I figured it was best to start from the beginning:

I've got 48 individuals from 4 locations that were double digested using XhoI and MseI, and run in one lane with 24 barcoded individuals and two indexes. I'm interested in looking at population structure between these locations, so I'm planning to use the populations function in stacks. I received the data from the sequencing facility as .fastq.gz files (one file per individual) after they demulitplexed the samples for us. We did paired end reads, and the paired ends are provided in a single file interleaved so that they're ordered as Read 1 of Fragment 1, Read 2 of Fragment 1, Read 1 of Fragment 2, Read 2 of Fragment 2, etc. 

The first step was cleaning the data, although I did not need to demultiplex it. I used the following command:

process_radtags -p ./raw/ -o ./samples/ -i 'gzfastq' -c -q -r -E 'phred33' --disable_rad_check

I got a message back saying 'No barcodes specified, files will not be demultiplexed', which is what I wanted, and the script continued to run. I chose to --disable_rad_check  mainly because XhoI is not included as an enzyme in stacks and I didn't think it was necessary to add it manually just for a RAD site check if I wasn't going to use it to demultiplex.

**Question 1: Is it a bad policy to disable the RAD site check? Should I add XhoI manually and run process_radtags with XhoI and MseI specified as a matter of good practice?

 process_radtags seemed to work well, and I now have all of my files as .fastq with the following structure:

@2_1101_1868_1956_1

TTCGAGGATACCTTCAAAAGGTGATGCCTCAAAAAGGTGACATCCATTACTAAGGAAACCACCACCCCACCCCCCATCACCCAGGACACGCCC

+

EHHHHHJJJJJJJJJJJJJJJHHIJJIJJJJJJJJJJFFHIJJJJJJJJJIJIJIJJJJ;EHHHHHFFACDDDDDDDDDDDDDBDDDDDDBDD

@2_1101_1868_1956_1

TAATCAAGTAATACAGAAAGAGAAGGAAAAGTACTGAGAAAATGTTCAGGGTTCATTGCCCATTCAGAAATATGATGGCGGAGGGAAAGAAGCTGTTCCT

+

CCCFFFFFHFHHHJJJJJJIIJJJJJJJIJJHIJJJJJJJJJIJIIJIIJGGHIIJJJJJJIJJJJJJJJJJJJJJJJJHFDDDDBDDDDDDDDDDDDDC

Where I believe the first @2 is Read 1 and the second @2 (slightly longer) is the paired-end Read 2. So far so good (maybe?).

 

Next I was planning to build mini-contigs from the paired end reads by following the paired-end tutorial on the stacks website. However, after reading through this post I'm beginning to think that building the mini-contigs is only necessary for single-digest paired-end reads, and not for my ddRAD paired-end, as the paired ends all have a restriction cut site and therefore should stack on their own. I saw that in the tutorial there was a file for Read 1 and a separate file for Read 2 that was used to build the mini-contigs, so I'm not sure that it will work with both reads in a single file…

 

**Question 2: Do I need to separate my paired-end reads into different files? Or, as above, do I not need to build mini-contigs at all?

 

Either way, the next step would be denovo_map.pl. I tried running the following command first:

 

denovo_map.pl -m 3 -M 3 -T 15 -B manta_radtags -b 1 -t -D "Manta Paired-end RAD-Tags" -o ./stacks -O ./popmap -s ./samples/BB001.fastq -s ./samples/BB003.fastq -s ./samples/BB004.fastq -s ./samples/BB006.fastq -s ./samples/BB008.fastq -s ./samples/BB009.fastq -s ./samples/BB010.fastq -s ./samples/BB011.fastq -s ./samples/BB012.fastq -s ./samples/BB013.fastq -s ./samples/BB014.fastq -s ./samples/BB015.fastq -s ./samples/IN001.fastq -s ./samples/IN002.fastq -s ./samples/IN007.fastq -s ./samples/IN010.fastq -s ./samples/IN011.fastq -s ./samples/IN012.fastq -s ./samples/IN013.fastq -s ./samples/IN014.fastq -s ./samples/INA01.fastq -s ./samples/INA02.fastq -s ./samples/INA03.fastq -s ./samples/INA04.fastq -s ./samples/INA05.fastq -s ./samples/INA1M.fastq -s ./samples/RV003.fastq -s ./samples/RV004.fastq -s ./samples/RV005.fastq -s ./samples/RV006.fastq -s ./samples/RV007.fastq -s ./samples/RV008.fastq -s ./samples/RV009.fastq -s ./samples/RV010.fastq -s ./samples/RV011.fastq -s ./samples/RV012.fastq -s ./samples/SL075.fastq -s ./samples/SL080.fastq -s ./samples/SL081.fastq -s ./samples/SL083.fastq -s ./samples/SL192.fastq -s ./samples/SL218.fastq -s ./samples/SL276.fastq -s ./samples/SL277.fastq -s ./samples/SL326.fastq -s ./samples/SL392.fastq -s ./samples/SL393.fastq -s ./samples/SL395.fastq

 

It ran overnight, at the end of the run my terminal lit up with (HY000) at line 1: File './stacks/BB014.matches.tsv' not found (Errcode: 2 - No such file or directory), missing a ton of files including the .matches.tsv and .fst.tsv for each individual, the batch_1.markers.tsv, and the batch_1.sumstats.tsv. Furthermore, the alleles.tsv/.snps.tsv/.tags.tsv files were only present in ./stacks for about a quarter of the individuals. After reading some threads on the forum here, I think that it may have been an issue with my memory. I'm running this from a personal laptop with 8GB DDR3 memory, which is probably not sufficient.

 

**Question 3: Are these errors likely a memory issue? I also read on a few threads that similar issues could arise from having reads that are not equal length, and I'm wondering if perhaps my paired ends are not all the same length and are stuffing things up as they're in the same file at the Read 1's. However, they look like they're the same length in the cleaned .fastq files…

 

Assuming it was memory, I next tried running these in smaller batches, and skipping the population map function in denovo_map.pl with plans to come back to the stacked data with the populations function, hoping that would require less memory. I ran the following command first:

 

denovo_map.pl -m 3 -M 3 -T 15 -B manta_radtags -b 1 -t -D "Manta Paired-end RAD-Tags" -o ./stacks -s ./samples/BB001.fastq -s ./samples/BB003.fastq -s ./samples/BB004.fastq -s ./samples/BB006.fastq -s ./samples/BB008.fastq -s ./samples/BB009.fastq -s ./samples/BB010.fastq -s ./samples/BB011.fastq -s ./samples/BB012.fastq -s ./samples/BB013.fastq -s ./samples/BB014.fastq -s ./samples/BB015.fastq

 

Things seemed to go pretty well:

Found 12 sample file(s).

Identifying unique stacks; file   1 of  12 [BB001]

  /usr/local/bin/ustacks -t fastq -f ./samples/BB001.fastq -o ./stacks -i 1 -m 3 -M 3 -p 15 -d -r 2>&1

Identifying unique stacks; file  12 of  12 [BB015]

  /usr/local/bin/ustacks -t fastq -f ./samples/BB015.fastq -o ./stacks -i 12 -m 3 -M 3 -p 15 -d -r 2>&1

  Loading ustacks output to manta_radtags...done.

 

…until the next line:

 

Generating catalog...

  /usr/local/bin/cstacks -b 1 -o ./stacks -s ./stacks/BB001 -s ./stacks/BB003 -s ./stacks/BB004 -s ./stacks/BB006 -s ./stacks/BB008 -s ./stacks/BB009 -s ./stacks/BB010 -s ./stacks/BB011 -s ./stacks/BB012 -s ./stacks/BB013 -s ./stacks/BB014 -s ./stacks/BB015  -p 15 2>&1

  Importing catalog to MySQL database...ERROR 2 (HY000) at line 1: File './stacks/batch_1.catalog.tags.tsv' not found (Errcode: 2 - No such file or directory)

ERROR 2 (HY000) at line 1: File './stacks/batch_1.catalog.snps.tsv' not found (Errcode: 2 - No such file or directory)

ERROR 2 (HY000) at line 1: File './stacks/batch_1.catalog.alleles.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   1 of  12 [BB001]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB001 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB001.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   2 of  12 [BB003]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB003 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB003.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   3 of  12 [BB004]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB004 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB004.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   4 of  12 [BB006]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB006 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB006.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   5 of  12 [BB008]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB008 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB008.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   6 of  12 [BB009]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB009 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB009.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   7 of  12 [BB010]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB010 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB010.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   8 of  12 [BB011]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB011 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB011.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   9 of  12 [BB012]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB012 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB012.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file  10 of  12 [BB013]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB013 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB013.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file  11 of  12 [BB014]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB014 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB014.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file  12 of  12 [BB015]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/BB015 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/BB015.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Calculating population-level summary statistics

/usr/local/bin/populations -b 1 -P ./stacks -s -t 15 2>&1

ERROR 2 (HY000) at line 1: File './stacks/batch_1.markers.tsv' not found (Errcode: 2 - No such file or directory)

ERROR 2 (HY000) at line 1: File './stacks/batch_1.sumstats.tsv' not found (Errcode: 2 - No such file or directory)

Indexing the database...

/usr/local/bin/index_radtags.pl -D manta_radtags -t -c 2>&1

 

Looking at my localhost/stacks manta_radtags database, the catalog had no samples in it, so there was obviously an issue loading the samples to the database, as the error messages suggest. Furthermore, only the .alleles/.snps/.tags files are present in ./stacks for this first run. I restarted my computer and ran the next 14 samples:

 

denovo_map.pl -m 3 -M 3 -T 15 -B manta_radtags -b 1 -t -D "Manta Paired-end RAD-Tags" -o ./stacks -s ./samples/IN001.fastq -s ./samples/IN002.fastq -s ./samples/IN007.fastq -s ./samples/IN010.fastq -s ./samples/IN011.fastq -s ./samples/IN012.fastq -s ./samples/IN013.fastq -s ./samples/IN014.fastq -s ./samples/INA01.fastq -s ./samples/INA02.fastq -s ./samples/INA03.fastq -s ./samples/INA04.fastq -s ./samples/INA05.fastq -s ./samples/INA1M.fastq

 

This seemed to go a bit better, although I think that I should have changed the batch number to -b 2 instead of -b 1 based on the error messages:

 

Found 14 sample file(s).

ERROR 1062 (23000) at line 1: Duplicate entry '1' for key 'PRIMARY'

Identifying unique stacks; file   1 of  14 [IN001]

  /usr/local/bin/ustacks -t fastq -f ./samples/IN001.fastq -o ./stacks -i 13 -m 3 -M 3 -p 15 -d -r 2>&1

Identifying unique stacks; file  14 of  14 [INA1M]

  /usr/local/bin/ustacks -t fastq -f ./samples/INA1M.fastq -o ./stacks -i 26 -m 3 -M 3 -p 15 -d -r 2>&1

  Loading ustacks output to manta_radtags...done.

Generating catalog...

  /usr/local/bin/cstacks -b 1 -o ./stacks -s ./stacks/IN001 -s ./stacks/IN002 -s ./stacks/IN007 -s ./stacks/IN010 -s ./stacks/IN011 -s ./stacks/IN012 -s ./stacks/IN013 -s ./stacks/IN014 -s ./stacks/INA01 -s ./stacks/INA02 -s ./stacks/INA03 -s ./stacks/INA04 -s ./stacks/INA05 -s ./stacks/INA1M  -p 15 2>&1

  Importing catalog to MySQL database...done.

Matching samples to catalog; file   1 of  14 [IN001]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/IN001 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file   2 of  14 [IN002]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/IN002 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file   3 of  14 [IN007]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/IN007 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file   4 of  14 [IN010]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/IN010 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file   5 of  14 [IN011]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/IN011 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file   6 of  14 [IN012]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/IN012 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file   7 of  14 [IN013]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/IN013 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/IN013.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file   8 of  14 [IN014]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/IN014 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file   9 of  14 [INA01]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/INA01 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file  10 of  14 [INA02]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/INA02 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file  11 of  14 [INA03]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/INA03 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...ERROR 2 (HY000) at line 1: File './stacks/INA03.matches.tsv' not found (Errcode: 2 - No such file or directory)

done.

Matching samples to catalog; file  12 of  14 [INA04]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/INA04 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file  13 of  14 [INA05]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/INA05 -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Matching samples to catalog; file  14 of  14 [INA1M]

  /usr/local/bin/sstacks -b 1 -c ./stacks/batch_1 -s ./stacks/INA1M -o ./stacks -p 15 2>&1

  Loading sstacks output to manta_radtags...done.

Calculating population-level summary statistics

/usr/local/bin/populations -b 1 -P ./stacks -s -t 15 2>&1

ERROR 2 (HY000) at line 1: File './stacks/batch_1.markers.tsv' not found (Errcode: 2 - No such file or directory)

ERROR 2 (HY000) at line 1: File './stacks/batch_1.sumstats.tsv' not found (Errcode: 2 - No such file or directory)

Indexing the database...

/usr/local/bin/index_radtags.pl -D manta_radtags -t -c 2>&1

 

This one seemed to go a lot better, and for this round of samples I had .alleles/.snps/.tags AND .matches.tsv, except for IN013 as shown in the error message. I also have batch_1.catalog.alleles.tsv, batch_1.catalog.snps.tsv, batch_1.catalog.tags.tsv, and batch_1.populations.log, which I did not have after the first batch run. I can also now see fragments in localhost/stacks manta_radtags. However, it clearly did not go perfectly as it's still missing batch_1.markers.tsv and batch_1.sumstats.tsv.

So, in summary, I'm having issues here with denovo_map.pl. Do you think that they're purely memory issues, or is it possible that the file structure is messing things up? How would you recommend proceeding?

On a more basic note, am I approaching this analysis with the correct workflow? Is it necessary to build contigs from the paired ends, or can I just run the files with both reads through the denovo_map.pl with a population map, as the paired ends all have the same MseI cut site?

Thanks very much for your input, and sorry for the newbie questions!

 

Cheers,

Josh

Julian Catchen

unread,
Sep 22, 2014, 8:48:56 PM9/22/14
to stacks...@googlegroups.com, joshua....@gmail.com
Hi Josh,

Some answers to your questions:

1) It is a good idea to keep the RAD site check enabled (along with
rescuing cut sites). To facilitate this, we can add the XhoI enzyme to
process_radtags, there should be a new Stacks release this week. As long
as you are willing to test it, we can add it.

2) Stacks expects your single and paired-end reads to be in separate
files. I'm not sure why your sequencing facility interleaved them, but
you might ask them for the raw, unprocessed data. This wouldn't be a
major problem, since you have ddRAD data, but based on what you wrote,
the IDs from your reads are no longer unique, and this may interfere
with the software (although I have never tested this particular
scenario). If the reads were in separate files, you would have gotten
two IDs, 2_1101_1868_1956_1 and 2_1101_1868_1956_2, which keep the reads
distinct. Different read lengths would be a problem, but your ddRAD data
will cause the longer reads to all stack together and the shorter reads
to all stack together, so in the end, all the single-end loci should be
the same length and all the paired-end loci should be the same length
(but different than the single-end length).

You do not need to build mini contigs at all since you are using double
digest data, and both your single and paired-end reads are anchored by a
restriction enzyme cutsite (and will stack-up). ustacks will build each
single and paired-end locus separately.

3) 8Gb of memory is not likely to be enough to run the pipeline. Even if
you succeed in running ustacks, the populations program will eventually
need more than that to calculate statistics across all 48 individuals.

In your first run, your catalog is failing to build (cstacks is failing)
and then all of your sstacks executions will fail to match against the
non-existent catalog.

When you re-run with a new set of data and get:

ERROR 1062 (23000) at line 1: Duplicate entry '1' for key 'PRIMARY'

this is coming from MySQL. Whenever you run a partial batch and then
restart it, you need to delete and recreate the database in between.
Otherwise you are mixing data from two different analyses together and
you will get strange results in the web interface.

Best,

julian
> <https://groups.google.com/forum/#!searchin/stacks-users/paired$20end$20single$20file/stacks-users/jpbOeilADcs/GzWMRzQzPtEJ>

joshua....@gmail.com

unread,
Sep 22, 2014, 11:18:45 PM9/22/14
to stacks...@googlegroups.com, joshua....@gmail.com, jcat...@uoregon.edu, Lani Gleason
Hi Julian,

I really appreciate all of your thorough answers—they're so incredibly helpful.

1) I'm 100% happy to test the XhoI enzyme if you add it to process_radtags. I'll download the new release of Stacks and will re-process my data with the rad site check and let you know if it works. Thanks for adding it!

2) I'm not sure why the facility interleaved them, either. The single end reads are 94bp and the paired ends are 101bp, and they all seem to be consistent with these two lengths. In the raw fastq files, the reads have the following format:

@HISEQ-MFG:299:C59KAACXX:1:1101:1845:1938 1:N:0:GGCTAC
TTCGAGTCACAACTTGGGACTTTCGGGCGGGCGGGTTCCACGGCCTCTGGCCGGTCGAGGCGCGCCATCCATCCACGGTGGGAGCAATCCCGC
+
FHHHHHJJJJJJJJJJJIJJJJJJJJJJJJFDDDDBDDDEDDDBBDDDDDDDDBBBDDDDDDDDDD@B>BCDDDDDDDBDDDBDBBDDDCDDD
@HISEQ-MFG:299:C59KAACXX:1:1101:1845:1938 2:N:0:GGCTAC
TAACGAGTTGCCCAGCCAACTTTTTGGCGCGGTTAGGGGCATAATTAGTGTTGTCCCCCTTGGCGGTAAAGTTAGGCGCCTCTTCGTTAACCGGCGCCGC
+
CCCFFFFFHHHHHJIGHIJJJJIJJJJJJJJJJJJJJIJIJJJGJJJJEHHHGHFFFFDDEEDDDD;BBCDDDEDDDDDDDDDDDDDDBDCC@DBDBDD>

which gets converted to the following after process_radtags:

@1_1101_1845_1938_1
TTCGAGTCACAACTTGGGACTTTCGGGCGGGCGGGTTCCACGGCCTCTGGCCGGTCGAGGCGCGCCATCCATCCACGGTGGGAGCAATCCCGC
+
FHHHHHJJJJJJJJJJJIJJJJJJJJJJJJFDDDDBDDDEDDDBBDDDDDDDDBBBDDDDDDDDDD@B>BCDDDDDDDBDDDBDBBDDDCDDD
@1_1101_1845_1938_1
TAACGAGTTGCCCAGCCAACTTTTTGGCGCGGTTAGGGGCATAATTAGTGTTGTCCCCCTTGGCGGTAAAGTTAGGCGCCTCTTCGTTAACCGGCGCCGC
+
CCCFFFFFHHHHHJIGHIJJJJIJJJJJJJJJJJJJJIJIJJJGJJJJEHHHGHFFFFDDEEDDDD;BBCDDDEDDDDDDDDDDDDDDBDCC@DBDBDD>

So it looks like they have unique ID's in the raw files, but after processing them they get lumped into a single ID, as the 1:N:0 or 2:N:0 is not contiguous with the ID number. First, do you think that running process_radtags with the rad site check and both enzymes might fix this, as it will recognize them as having the same ID with two different restriction enzymes? Wishful thinking, I'm guessing. Second, since they are different lengths and have different rad cut sites, perhaps it doesn't matter that they have the same ID, as they'll stack separately anyway? Will denovo_map treat them basically as just a second set of single-end data in the same file (and with replicate IDs), but with a different enzyme cut site?

3) I'm going to try running this on a server with 64gb of memory, and hopefully that will do the trick. Thanks for confirming it's a memory issue and explaining the batch error. 

Thanks again for all of your input and assistance.

Cheers,
Josh

Julian Catchen

unread,
Sep 23, 2014, 1:43:33 AM9/23/14
to stacks...@googlegroups.com, joshua....@gmail.com, Lani Gleason
Hi Josh,

You will need to de-interleave the file to use it with process_radtags. Each read needs a unique ID for the software to function properly. In principle we could add support for interleaved files, but the standard Illumina software distributes them as separate files (unless this has recently changed?).

You can do this easily using some shell. To get the second set of reads:

cat myfile.fq | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' > myfile.2.fq

And the first set of reads:

cat myfile.fq | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' > myfile.1.fq

Or, if they are gzipped:

zcat myfile.fq | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > myfile.2.fq.gz

Then run this through process_radtags.

I will add the XhoI enzyme in the next release.

Best,

julian

joshua....@gmail.com wrote:
--
Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

joshua....@gmail.com

unread,
Sep 23, 2014, 11:26:50 AM9/23/14
to stacks...@googlegroups.com, joshua....@gmail.com, lani.g...@gmail.com, jcat...@uoregon.edu
Okay great, thanks for the code—that saved me many hours of figuring out how to de-interleave them on my own! 

So, to recap:

1) Create separate files for the single and paired end reads
2) Run process_radtags on the single and paired ends separately with rad site check enabled
3) Concatenate the two read files into a single fastq file for each individual
4) Run the individuals through denovo_map.pl with single and paired ends in the same file and a population map specified. 

Thanks again for your help.

Cheers,
Josh
Reply all
Reply to author
Forward
0 new messages