Running Demultiplexed Data

154 views
Skip to first unread message

leanne....@gmail.com

unread,
Sep 30, 2015, 1:02:33 AM9/30/15
to AftrRAD
Hi there!

I've just got a couple of questions regarding using demultiplexed data.
In AftrRAD.pl the restriction enzyme parameter should be set to 0, right? If not how should you handle double digestion, just use one of the enzymes?
Also one question regarding barcodes - for the demultiplexed data I'm not creating a barcode file for AftrRAD.pl to use. But it seems to create one of it's own - and with barcodes that don't match those of my samples. Why does it do this? And does this get used further down the pipeline and possibly mess things up?

Thanks
Leanne

Mike Sovic

unread,
Sep 30, 2015, 10:12:47 AM9/30/15
to AftrRAD

Hi Leanne,

 

In terms of the restriction enzyme, generally, double digest protocols are designed such that all sequence reads associated with one sequencing direction will be associated with one of the two restriction enzymes.   So, even if you're doing double digest, I think you'll likely still have a RE site in your sequences, and yes, you'd need to identify the appropriate one. 

 

If you're unsure which one to use, the easiest way to check might be to view your fastq file (should probably do this with a command like 'more' or 'less' in a Terminal window - likely too big for programs to open the whole file), and look for a part of the sequence that is repeated in every read (ignoring sequencing error).  If you have demultiplexed data, and no inline barcodes (see below), this repeated section should be at the beginning of each read.  Note that this will probably not make up the entire recognition sequence for the enzyme, and the portion that is actually in the reads is what you want to feed in to AftrRAD for the 're' parameter.

 

In terms of your demultiplex question, I might need a bit of clarification where you say the barcodes it creates don't match your samples, but the demultiplex option might be a little confusing in general, so I'll try to explain it here a little better than it is in the current manual.  If this doesn’t answer your question, then let me know.

 

First, there are two main ways to label each individual sample when prepping libraries.  One option is to construct your adaptors such that a “barcode” unique to each sample becomes the first X bases (usually 5-7, but could be more or less) of each sequence read for that sample.  In this case, if you do a single-read 50 bp Illumina run with 6 bp barcodes, then you sacrifice 6 of the 50 bases as barcode.  I'll refer to this as inline-barcodes, or just "barcodes" for short.  An alternative is to put an "index" sequence more in the middle of the adaptor, distant from where the adaptor actually attaches to the fragment.  This is the most common method in Illumina sequencing, especially when you get outside of RADseq (though lots of RADseq folks do this too).  For clarity, I'll refer to these as "indexes" as opposed to "barcodes", though these two terms are often interchanged.  In the case of indexes, the sequencer has to be set for an “indexed run”, and in the case of a SR-50 index run, it would sequence the first 50 bases of your fragments as normal, but then start a second, separate sequencing reaction with a different sequencing primer that is oriented in the opposite direction as the first, sequencing just the index position.  It will use these index reads to sort the sequences out into different fastq files (demultiplexing).  This is unlike the inline-barcode approach, where the sequencer doesn’t actually interpret the barcodes, but just dumps all the sequences into the same fastq file – you have to sort them out later. 

 

When we first put AftrRAD together, we wrote it for our own data (yeah, maybe a little selfish there), which at the time only contained in-line barcodes (everything was in one fastq file).  This is where the Barcode file comes in to play, as it tells AftrRAD how to sort the reads out and assign them to individual samples based on the inline barcodes.  The way it is written, AftrRAD needs each sequence read to start with an inline barcode, but datasets constructed with the index approach don't have these, and instead, their reads are demultiplexed into individual files/samples by the sequencer.  The easiest way for me to deal with this was to generate a unique, random barcode sequence for each of the demultiplexed files, and use these to create inline barcodes by adding them to the reads in the appropriate fastq file.  Then, we just concatenate all of the deumltiplexed files together (with all reads now identified with an inline barcode) into a single file that we name “AllSamples.txt” (so essentially, we’re going backwards by undoing the demultiplexing that the sequencer did).  This puts the data in the format that AftrRAD expects.  From here, AftrRAD re-demultiplexes the samples as part of a normal run, doing it based on the barcodes that were added.

 

So, in terms of your specific question, the barcodes in the Barcode file that is created when the 'dplexedData' flag is used are randomly generated barcodes, and so will not match the index sequences contained in your adaptors.  This is OK, and you are correct that if you're using the 'dplexedData' flag, you shouldn't be entering any barcodes at the start of the run.  Hopefully, this is what you were asking.  Alternatively, if when using this 'dplexedData' flag, sequence reads from one original demultiplexed file had gotten the same inline barcode as sequence reads from a different demultiplexed file, then this would certainly be a problem.  I’m not aware of any problems like this, but if you think this may be occurring, please let me know and we’ll look in to it more.


Sorry about the really long response, but hopefully that will clarify something for someone!



                     Mike

leanne....@gmail.com

unread,
Oct 1, 2015, 1:35:27 AM10/1/15
to AftrRAD
Thanks Mike! 
Very helpful response! I now understand the barcoding and as you explained, the data I have uses indexes rather than inline barcodes. I also found which enzyme was used for sequencing (sorry I'm playing catch up as I didn't do the lab work and have just been handed the files). But more often than not this re sequence is somewhere in the middle or end of the reads, rather than the start. Are these reads just sequencing error or is there something else that could be going on?

Mike Sovic

unread,
Oct 1, 2015, 11:24:02 AM10/1/15
to AftrRAD
Hi Leanne,

Great - glad it was helpful.  I would definitely double check things in terms of your RE sequence.  Maybe I'm missing something here, and someone please correct me if I'm wrong, but with the exception of adding in-line barcodes to your adaptors, I can't think of a scenario where the RE site wouldn't be the beginning of the sequence.  The beauty of RADseq is that these RE sites essentially "anchor" where the sequencing starts, so that homologous loci can be sampled efficiently from multiple samples.  I don't think they should be showing up at different places throughout the reads.  

fau...@gmail.com

unread,
Dec 8, 2015, 7:32:02 PM12/8/15
to AftrRAD
Hi Mike,

I've finally had time to get back to this! The original data I was working with was preprocessed with Trimmomatic, but it seemed to do a very poor job resulting in reads with the RE still present throughout (as I explained earlier). So I've now cleaned up the data using cutadapt and RE sequences are no longer in the reads.
I've then tried running AftrRAD again with the following...

Arguments entered are...

Help 0

re 0

numIndels 3

Phred 33

minQual 20

DataPath Data/

MaxH 90

BarcodePath Barcodes/

P2 noP2

stringLength 15

minIden 90

minParalog 5

dplexedData 1

minDepth 5


But the run and the whole computer freezes at this point...

......

Filtering sequence 366000000.

Use of uninitialized value $TotalBarcodeNonMatches in concatenation (.) or string at AftrRAD.pl line 957.

Use of uninitialized value $TotalBarcodeNonMatches in concatenation (.) or string at AftrRAD.pl line 958.


Demultiplexing samples for data file AllSamples.txt.



No specific error message - I just come back to the computer and it is frozen. I thought it might be a space issue but the external HD I'm using still has 1TB available.


Any other ideas?


Thanks

Leanne


Mike Sovic

unread,
Dec 9, 2015, 11:07:09 AM12/9/15
to AftrRAD
Hi Leanne,

Yeah, first, just to clarify, those "initialized value" warnings are a bug associated with running demultiplexed data, but are not causing any problems - will fix in upcoming version.

The first thing to do is to take a look at the raw data going in to AftrRAD and see if you have sequences of varying lengths.  If these data have been run through Trimmomatic, then I'm guessing the sequences are different lengths.  This is not a good idea with AftrRAD, and is probably why it's getting hung up at that step (it sees lots more unique sequences than it should, because two reads with the same sequence, but different lengths, are considered different from each other).  The number of unique reads is an important factor in the time it takes to complete the step you're getting hung up at.  The upcoming version will have some updates that reduce the time for this step, esp. for large datasets, but even with that, there are other reasons that having sequences of different lengths is bad.

If the reads are different lengths, my suggestion would be to go back to the original fastq files and try using these (not edited with Trimmomatic or other programs).  If read quality is especially bad toward the end of most/all of the reads, then I'd suggest trimming them all down to the same length prior to running.  If you want to do this, let me know - we have a simple script that I'm happy to make available for this purpose.

If this isn't the case, let us know, and we'll try to figure out what else might be happening.

           Mike 

fau...@gmail.com

unread,
Dec 13, 2015, 7:03:42 PM12/13/15
to AftrRAD

Hi Mike,


Apologies if you're getting this message to you inbox several times.


Thanks. I don't think it should be the read lengths causing a problem. I realised this could be an issue when I first ran AftrRAD as the data I have is from two separate sequencing runs. So at first I trimmed everything back to 49bp then ran through cutadapt with two steps - trimming adaptors and then removing reads less than 49bp.


Leanne

fau...@gmail.com

unread,
Dec 15, 2015, 8:19:11 PM12/15/15
to AftrRAD
Hi Mike,

Perhaps it's an issue with not specifying an RE? To test this I ran it last night with the RE seq and it seemed to work. (but then I got an error at the Genotype stage...) But I'm not worried about that just yet. In setting the RE and checking the report in the RunInfo folder I found that AftrRAD found many RE sequences even after the cutadapt cleanup I had done. So I'm going back a step! :/  
I remain concerned something didn't quite go right in the lab. As I can't attach files on this group forum I'll forward another email to your inbox directly. If you have time to check it over I'd appreciate any ideas you have.

Thanks
Leanne

fau...@gmail.com

unread,
Dec 16, 2015, 1:37:46 AM12/16/15
to AftrRAD
Hi again,

By including the RE parameter in AftrRAD I think it is doing a better job of removing all reads containing the RE/Adaptor sequences than cutadapt.

But then I get this error when running the Genotype.pl

Currently genotyping sample...SscRAD1_201_trimmed49.fq sh: R: command not found

No such file or directory at Genotype.pl line 168.

Mike Sovic

unread,
Dec 18, 2015, 12:53:41 PM12/18/15
to AftrRAD
Hi Leanne,

Thanks for sending those files by e-mail - they were helpful.  I don't completely understand how the protocol works (maybe just from not spending enough time trying to work through it), but I think I get what the goal is in terms of incorporating the RE site into the adaptor (I'm assuming just to avoid "wasting" sequencing effort on the cut sites?). 

Anyway, assuming it's generating good data, and that no part of the restriction site is expected in the sequences, my thoughts would have been…

1.)  As you originally suggested, the RE flag should be set to '0' (no restriction enzyme on the R1 end).

2.)  I guess you could set the P2 flag to something like 'GCTCTTCCG', as this appears to be the adaptor you're searching for using cutadapt.  I'd just do it this way and not use cutadapt.  However, it looks to me like this sequence is actually part of the Illumina R1 sequencing primer, so I'm not sure how it would be in there if this is the sequencing primer that was used.  If there's any adaptor in there, I would expect it to start with something like 'TTAGAGATCGG…', if I'm interpreting things correctly from the protocol you sent.  I guess this would go back to the question of the quality of the data that were generated in the first place.  

Sorry - not sure that any of that is going to be of much help.  If you think the most recent run you did is OK, I'd suggest the following…

Go to the folder "TempFiles/RawReadCountFiles".  In here, there should be two files for each sample in the dataset.  One of these will have "NonParalogous" in the file name.  Open any of these files and just scroll down through them.  These contain all of the polymorphic loci identified, and the number of reads mapping to each of the alleles for that sample.  Make sure this is the case for each of your samples.  If not, there's definitely still a problem with the AftrRAD.pl run that needs addressed.

      Mike

fau...@gmail.com

unread,
Feb 18, 2016, 7:06:51 PM2/18/16
to AftrRAD
Hi Mike,

After attempting (and failing) to run the data set on a couple of different 4GB memory machines I now have it running on a 64GB memory linux. It has been going ok the last 2 days but this morning I have a terminal screen full of this error message:

Use of uninitialized value $Middle7C[7] in join or string at AftrRAD.pl line 1667, <VARIANCEOUT> line 100938
Use of uninitialized value $Middle7C[8] in join or string at AftrRAD.pl line 1667, <VARIANCEOUT> line 100938
Use of uninitialized value $Middle7C[9] in join or string at AftrRAD.pl line 1667, <VARIANCEOUT> line 100938
Use of uninitialized value $Middle7C[10] in join or string at AftrRAD.pl line 1667, <VARIANCEOUT> line 100938

It still seem to be running but just continuously printing out these messages.
Any ideas?

Leanne

Mike Sovic

unread,
Feb 19, 2016, 7:19:24 AM2/19/16
to AftrRAD
Hi Leanne,

Well, my first thought is that there are sequences of different lengths in the dataset.  See if you can open the file 'TempFiles/ErrorReadTest/ErrorTestOut.txt', scroll through it, and make sure that everything in there is the same length.  Specifically, based on the error message you sent, check line 100938 of this file.  If there are reads of different lengths, then I'm sure this is the issue.  If not, then I'll have to think about this one a bit more.

             Mike
Reply all
Reply to author
Forward
0 new messages