Demultiplexed data containing four barcodes

652 views
Skip to first unread message

olga kozhar

unread,
Mar 27, 2020, 7:57:42 PM3/27/20
to Stacks
Hello,

I am new to Stacks and RADseq data, and I am trying to figure out if I can use Stacks for the project I just inherited. The library was prepared by someone else, and sequence has been done before I joined the team. Here is what I have been given:

I have ddRADseq data for a population of 120 individuals. The data have been already demultiplexed, but it still contains barcodes. We looked in one of the demultiplexed files and discovered barcodes there. I also have Excel file containing barcodes for each individual, however, the issue that I am experiencing now is that each individual is assigned four barcodes instead of two as I expected.

Here is the example:

Isolate ID Cla -internal barcode - Bam - internal barcode i5 -barcode i7 - barcode
AS-1.1-PN GATACCATCGG GCATCACGATCC TATGACCG GGTGTCTT
AS-1.2-PN AGCGTTGATCGG ATGCTGTCGATCC AGCCAACT AAGAAGGC
AS-12.1-PN CTGCAACTATCGG ATGCTGTCGATCC TGGATGGT GCTGGATT
AS-12.2-PN TCATGGTCAATCGG GCATCACGATCC TTCGAAGC TAACGAGG

Since it was ddRADseq I realize I have barcodes for each of the enzymes (Cla and Bam), but I also have index barcodes i5 and i7.
It doesn't seem I can use all four barcodes in process_radtags (not surprisingly, I am getting an error that my barcodes file has too many columns). I realize I may be missing some background information, but I went ahead and played with a small dataset. I have tried few different scenarios by pairing barcodes in different runs, e.g., only Cla and Bam, only Cla and i5 etc. As a result I get either few or zero reads retained, due to either barcode not found drops or RAD cutsite not found drops. I understand that the program cannot identify barcodes in reads and, therefore, drops them.

Given this amount of information that I have so far, could someone please advice me if it is possible at all to use Stacks for the data I have got (demultiplexed but with barcodes), and with four barcodes listed for each individual? Any suggestions that could navigate me into the right direction would be very helpful.

Thank you very much,
Olga

Julian Catchen

unread,
Mar 29, 2020, 3:21:43 PM3/29/20
to stacks...@googlegroups.com, olga kozhar
Hi Olga,

When you say your data have already been demultiplexed, do you mean that
the i5/i7 barcodes were demultiplexed, but the inline barcodes are still
present? Typically, the Illumina software will demulitplex its own
barcodes (the i5/i7 ones here). If this is the case, then you simply
ignore the i5/i7 barcodes and just specify the internal, inline
barcodes. Then run process_radtags on each pair of i5/i7 demultiplexed
files.

julian

olga kozhar wrote on 3/27/20 6:57 PM:

olga kozhar

unread,
Mar 30, 2020, 6:51:49 PM3/30/20
to Stacks
Hi Julian,

thank you for the response. I think you are right regarding demultiplexing i5/i7 barcodes only. Also, I was able to retain more information about library preparation. RAD3 method was used for this project (the third enzyme was used to cut dimers). In Adapterama III paper where they describe the method they talk about using Stacks for data cleaning and processing. So I just assumed that my data should work with software. First, I want to ask the question: Can the 3RAD be the issue here?

Three enzymes that were used: BamHI, mspI, and ClaI. In the Adapterama III paper they provide the table with enzymes and say that in this case Cla enzyme will be the third enzyme (for cutting dimers).

I looked into sequences of two files (that correspond to two reads for the same sample) and barcodes present in the reads. The cutting sites match mspI and bamHI.

I ran the following script for the pair of reads that I checked sequences of first:

process_radtags -1 ./AS-1_2-PN_1.fastq.gz -2 ./AS-1_2-PN_2.fastq.gz -o ./cleaned/ -b ./barcodes_no_cutting_site_AS.txt --renz_1 mspI --renz_2 bamHI --inline-inline -r -c -q -y fastq --adapter_1 TACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2


As you can see in the name of barcodes file I deleted sequences for the cutting sites from them, meaning that instead of mspI barcode AGCGTTGATCGG I used AGCGTTGAT. And instead of Bam barcode ATGCTGTCGATCC I used ATGCTGTC (I followed your suggestions from the thread https://groups.google.com/forum/#!topic/stacks-users/Sa8X4aArDHA).

As a result of this analysis, I received the following statistics:

1375804 total sequences
  11222 reads contained adapter sequence (0.8%)
 702566 barcode not found drops (51.1%)
  53098 low quality read drops (3.9%)
  16553 RAD cutsite not found drops (1.2%)
 592365 retained reads (43.1%)

So, 43.1% of reads retained. In 51.1% data barcode was not found.That's in half of the reads.

I tried to run analysis with only one restriction enzyme: first, only mspI:

1375804 total sequences
  11470 reads contained adapter sequence (0.8%)
 690960 barcode not found drops (50.2%)
  60078 low quality read drops (4.4%)
   2058 RAD cutsite not found drops (0.1%)
 611238 retained reads (44.4%)

Very similar statistics to the one when both enzymes were listed. In half of cases barcodes not found.

Second, only with Bam:

1375804 total sequences
      0 reads contained adapter sequence (0.0%)
1375784 barcode not found drops (100.0%)
      3 low quality read drops (0.0%)
      5 RAD cutsite not found drops (0.0%)
     12 retained reads (0.0%)


As a result, 0% retained reads. Based on this I want to say that Stacks cannot detect Bam barcode. Is my conclusion correct? In the fasta file this barcode is present, I checked the sequence, it is the same as I use in barcodes file.

I feel stuck at this point. Maybe I am missing something that is obvious for a more skilled eye, but I do not understand why in 50% data the barcode is not found.

Do you by chance have any suggestions what may be causing this issue and how should I proceed from here?

Alice Mouton

unread,
Nov 25, 2020, 9:09:15 AM11/25/20
to Stacks
Hi Olga,
did you resolve your problem? I am using 3RAD as well with Msp1,Cla,BamH1 and I run into similar issues. Only for ClaD and ClaH, all samples are discarded. The retained reads are only saved into my rem2.fq so I can't do any downstream analyses
As you can see almost half of them have Low Quality reads. I would not be concerned if it was happening in only one sample but here I have this with all samples with ClaD and ClaH and my colleague who did his libraries independently than me has exactly the same results too. When I grep the barcode for ClaD and ClaH, I found it in almost 95% of my reads. I'd like to know how you moved forward. 
Thanks!


Total Sequences 2719228
Barcode Not Found       74080
Low Quality     1162122
RAD Cutsite Not Found   602331
Retained Reads  880695

Barcode Filename                                                      Total     NoRadTag        LowQuality      Retained
GGTCTACGTA-GTATGCAC     MLU_040_LA      2645148    602331               1162122          880695


Julian Catchen

unread,
Nov 29, 2020, 3:11:16 PM11/29/20
to stacks...@googlegroups.com, Alice Mouton
Hi Alice,

As you can see from the process_radtags output, barcodes are found in
almost all of your samples (only 74k reads were discarded for no
barcode, or a barcode with too many sequecning errors). As you mention,
most of your reads are discarded for quality, which can be one of two
things (depending on how you ran the program). 1) Either there is one or
more uncalled bases in your reads (Ns), or your phred scores are low. I
would bet on the former problem. How many of your raw reads have Ns in
them? This can sometimes happen systematically on the flow cell if there
is a problem during sequencing. You can choose to keep these reads by
omitting the -c flag. If it is general low phred scores, you can see
this by looking at the plots from a program like fastqc.

Best,

julian

Alice Mouton wrote on 11/25/20 8:09 AM:

Alice Mouton

unread,
Dec 11, 2020, 9:17:04 AM12/11/20
to Stacks
Hi Julian,
Many thanks for your answer.  I ran fastqc on the original sample and then on sample that was trimmed by trimgalor (MLU_val1). When I run processradtag omitting the -c it still somehow removes most of the read. I see that the beginning of my fastq looks weird so I wonder if I should maybe cut the first ten bases and last five from my fastq files with fastq? then running through process radtag to do the filtering anyway without any info on the enzymes used? just with -c -q?  however then I will lose the barcode. 
 Is it something you would recommend? 
Many thanks!
Best regards
Alice

process_radtags -1 ../Input_Fastq/MLU040LA_R1_001_val_1.fq.gz -2 ../Input_Fastq/MLU040LA_R2_001_val_2.fq.gz -b ./barcode/mi55.txt --inline_inline -r -t 130 --renz_1 MspI --renz_2 BamHI -o $SCRATCH/3RAD/Output_processradtag/ -i gzfastq

File    Retained Reads  Low Quality     Barcode Not Found       RAD cutsite Not Found   Total
MLU040LA_R1_001_val_1.fq.gz     942087  744543  74080   958518  2719228

Total Sequences 2719228
Barcode Not Found       74080
Low Quality     744543
RAD Cutsite Not Found   958518
Retained Reads  942087

Barcode Filename        Total   NoRadTag        LowQuality      Retained
GGTCTACGTA-GTATGCAC     MLU_040_LA      2645148 958518  744543  942087
MLU040LA_R1_001_fastqc.html
MLU040LA_R1_001_val_1_fastqc.html

Alice Mouton

unread,
Mar 26, 2021, 10:18:32 AM3/26/21
to Stacks
HI Julian, I know I am getting back months later. I have been working on my successful reads but I need to do new libraries on new samples and I want to understand why I have so much drops in my sequences.  yes you were right about the N. Some of my reads have one N somewhere at the beginning. It s likely the reason behind my reads dropping. I would like to keep these reads now that I see that this is something at the beginning of my reads. I just don't understand how to remove flags. Even when I write the single command omitting -c and -q, it still removes all my reads. In the manual these flags are written in bold so are there always working even when they are not specified in the command line? 
The University cluster has Stacks2.41 installed.

Best regards
Alice Mouton

On Sunday, November 29, 2020 at 9:11:16 PM UTC+1 jcatchen wrote:

Julian Catchen

unread,
Mar 29, 2021, 10:56:37 AM3/29/21
to stacks...@googlegroups.com, Alice Mouton
Hi Alice,

If you do not specify -c or -q, then they will not be activated. If you
omit these options, you are not cleaning the reads or even examining the
phred scores, just demultiplexing and checking the cutsite (and these
functions can also be disabled). You will need to look at the log file,
which will tell you why reads are being dropped (bad barcode or bad
cutsite if you are not specifying -c or -q).

Best,

julian

Alice Mouton wrote on 3/26/21 9:18 AM:

Alice Mouton

unread,
Mar 29, 2021, 2:58:46 PM3/29/21
to Julian Catchen, stacks...@googlegroups.com
Dear Julian, 
Many thanks for your reply and your patience. I resolved the problem! I have been bothering so many people and I finally found why I had half of my reads dropped. I have to share my findings for the community, I don't know how I never thought about it and to be honest this is like processradtag beginner 101...:)
I used 3RAD for my libraries (Cla/msp1/Bamh1). When I was running process radtag on my libraries done with ClaD and ClaH, I had only 44% of reads recovered. I thought I had problems with my libraries, the primers, the sequencing..I never thought about the length of the barcode. ClaD and ClaH have 11 nucleotides and when I was setting my -t at 140 for my 150 bp reads, of course it was dropping half my reads (given there were 139 length..)..even when I was omitting -c and -q... . My other barcodes have between 8 and 10 so my command with process radtag and -t 140 worked well on the other samples..
Now I did the test with -t 135 (and with -c and -q ) and I have 94% read recovery. I just ran on two samples though but I am pretty sure the failed ones will now pass successfully the next step. 
If anyone has some trouble with half the reads disappearing during process radtags, please check also the length of your inline barcodes...
Many thanks again Julian,
Best regards
Alice   
--

Alice Mouton  Ph.D. 
FNRS Postdoctoral Researcher 
InBios - Conservation genetics lab
University of Liege 

Office Phone: (0032)43662130 

Note: You may get emails from me outside of normal working hours. Please do not feel any pressure to respond outside of your own working pattern.


Reply all
Reply to author
Forward
0 new messages