Demultiplexed data containing four barcodes

olga kozhar

unread,

Mar 27, 2020, 7:57:42 PM3/27/20

to Stacks

Hello,

I am new to Stacks and RADseq data, and I am trying to figure out if I can use Stacks for the project I just inherited. The library was prepared by someone else, and sequence has been done before I joined the team. Here is what I have been given:

I have ddRADseq data for a population of 120 individuals. The data have been already demultiplexed, but it still contains barcodes. We looked in one of the demultiplexed files and discovered barcodes there. I also have Excel file containing barcodes for each individual, however, the issue that I am experiencing now is that each individual is assigned four barcodes instead of two as I expected.

Here is the example:

Isolate ID	Cla -internal barcode -	Bam - internal barcode	i5 -barcode	i7 - barcode
AS-1.1-PN	GATACCATCGG	GCATCACGATCC	TATGACCG	GGTGTCTT
AS-1.2-PN	AGCGTTGATCGG	ATGCTGTCGATCC	AGCCAACT	AAGAAGGC
AS-12.1-PN	CTGCAACTATCGG	ATGCTGTCGATCC	TGGATGGT	GCTGGATT
AS-12.2-PN	TCATGGTCAATCGG	GCATCACGATCC	TTCGAAGC	TAACGAGG

Since it was ddRADseq I realize I have barcodes for each of the enzymes (Cla and Bam), but I also have index barcodes i5 and i7.

It doesn't seem I can use all four barcodes in process_radtags (not surprisingly, I am getting an error that my barcodes file has too many columns). I realize I may be missing some background information, but I went ahead and played with a small dataset. I have tried few different scenarios by pairing barcodes in different runs, e.g., only Cla and Bam, only Cla and i5 etc. As a result I get either few or zero reads retained, due to either barcode not found drops or RAD cutsite not found drops. I understand that the program cannot identify barcodes in reads and, therefore, drops them.

Given this amount of information that I have so far, could someone please advice me if it is possible at all to use Stacks for the data I have got (demultiplexed but with barcodes), and with four barcodes listed for each individual? Any suggestions that could navigate me into the right direction would be very helpful.

Thank you very much,

Olga

Julian Catchen

unread,

Mar 29, 2020, 3:21:43 PM3/29/20

to stacks...@googlegroups.com, olga kozhar

Hi Olga,

When you say your data have already been demultiplexed, do you mean that
the i5/i7 barcodes were demultiplexed, but the inline barcodes are still
present? Typically, the Illumina software will demulitplex its own
barcodes (the i5/i7 ones here). If this is the case, then you simply
ignore the i5/i7 barcodes and just specify the internal, inline
barcodes. Then run process_radtags on each pair of i5/i7 demultiplexed
files.

julian

olga kozhar wrote on 3/27/20 6:57 PM:

olga kozhar

unread,

Mar 30, 2020, 6:51:49 PM3/30/20

to Stacks

Hi Julian,

thank you for the response. I think you are right regarding demultiplexing i5/i7 barcodes only. Also, I was able to retain more information about library preparation. RAD3 method was used for this project (the third enzyme was used to cut dimers). In Adapterama III paper where they describe the method they talk about using Stacks for data cleaning and processing. So I just assumed that my data should work with software. First, I want to ask the question: Can the 3RAD be the issue here?

Three enzymes that were used: BamHI, mspI, and ClaI. In the Adapterama III paper they provide the table with enzymes and say that in this case Cla enzyme will be the third enzyme (for cutting dimers).

I looked into sequences of two files (that correspond to two reads for the same sample) and barcodes present in the reads. The cutting sites match mspI and bamHI.

I ran the following script for the pair of reads that I checked sequences of first:

process_radtags -1 ./AS-1_2-PN_1.fastq.gz -2 ./AS-1_2-PN_2.fastq.gz -o ./cleaned/ -b ./barcodes_no_cutting_site_AS.txt --renz_1 mspI --renz_2 bamHI --inline-inline -r -c -q -y fastq --adapter_1 TACACTCTTTCCCTACACGACGCTCTTCCGATCT --adapter_2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT --adapter_mm 2

As you can see in the name of barcodes file I deleted sequences for the cutting sites from them, meaning that instead of mspI barcode AGCGTTGATCGG I used AGCGTTGAT. And instead of Bam barcode ATGCTGTCGATCC I used ATGCTGTC (I followed your suggestions from the thread https://groups.google.com/forum/#!topic/stacks-users/Sa8X4aArDHA).

As a result of this analysis, I received the following statistics:

1375804 total sequences
11222 reads contained adapter sequence (0.8%)
702566 barcode not found drops (51.1%)
53098 low quality read drops (3.9%)
16553 RAD cutsite not found drops (1.2%)
592365 retained reads (43.1%)

So, 43.1% of reads retained. In 51.1% data barcode was not found.That's in half of the reads.

I tried to run analysis with only one restriction enzyme: first, only mspI:

1375804 total sequences
11470 reads contained adapter sequence (0.8%)
690960 barcode not found drops (50.2%)
60078 low quality read drops (4.4%)
2058 RAD cutsite not found drops (0.1%)
611238 retained reads (44.4%)

Very similar statistics to the one when both enzymes were listed. In half of cases barcodes not found.

Second, only with Bam:

1375804 total sequences
      0 reads contained adapter sequence (0.0%)
1375784 barcode not found drops (100.0%)
      3 low quality read drops (0.0%)
      5 RAD cutsite not found drops (0.0%)
     12 retained reads (0.0%)

As a result, 0% retained reads. Based on this I want to say that Stacks cannot detect Bam barcode. Is my conclusion correct? In the fasta file this barcode is present, I checked the sequence, it is the same as I use in barcodes file.

I feel stuck at this point. Maybe I am missing something that is obvious for a more skilled eye, but I do not understand why in 50% data the barcode is not found.

Do you by chance have any suggestions what may be causing this issue and how should I proceed from here?

Alice Mouton

unread,

Nov 25, 2020, 9:09:15 AM11/25/20

to Stacks

Hi Olga,

did you resolve your problem? I am using 3RAD as well with Msp1,Cla,BamH1 and I run into similar issues. Only for ClaD and ClaH, all samples are discarded. The retained reads are only saved into my rem2.fq so I can't do any downstream analyses

As you can see almost half of them have Low Quality reads. I would not be concerned if it was happening in only one sample but here I have this with all samples with ClaD and ClaH and my colleague who did his libraries independently than me has exactly the same results too. When I grep the barcode for ClaD and ClaH, I found it in almost 95% of my reads. I'd like to know how you moved forward.

Thanks!

Total Sequences 2719228

Barcode Not Found 74080

Low Quality 1162122

RAD Cutsite Not Found 602331

Retained Reads 880695

Barcode Filename Total NoRadTag LowQuality Retained

GGTCTACGTA-GTATGCAC MLU_040_LA 2645148 602331 1162122 880695

Julian Catchen

unread,

Nov 29, 2020, 3:11:16 PM11/29/20

to stacks...@googlegroups.com, Alice Mouton

Hi Alice,

As you can see from the process_radtags output, barcodes are found in
almost all of your samples (only 74k reads were discarded for no
barcode, or a barcode with too many sequecning errors). As you mention,
most of your reads are discarded for quality, which can be one of two
things (depending on how you ran the program). 1) Either there is one or
more uncalled bases in your reads (Ns), or your phred scores are low. I
would bet on the former problem. How many of your raw reads have Ns in
them? This can sometimes happen systematically on the flow cell if there
is a problem during sequencing. You can choose to keep these reads by
omitting the -c flag. If it is general low phred scores, you can see
this by looking at the plots from a program like fastqc.

Best,

julian

Alice Mouton wrote on 11/25/20 8:09 AM:

Alice Mouton

unread,

Dec 11, 2020, 9:17:04 AM12/11/20

to Stacks

Hi Julian,

Many thanks for your answer. I ran fastqc on the original sample and then on sample that was trimmed by trimgalor (MLU_val1). When I run processradtag omitting the -c it still somehow removes most of the read. I see that the beginning of my fastq looks weird so I wonder if I should maybe cut the first ten bases and last five from my fastq files with fastq? then running through process radtag to do the filtering anyway without any info on the enzymes used? just with -c -q? however then I will lose the barcode.

Is it something you would recommend?

Many thanks!

Best regards

Alice

process_radtags -1 ../Input_Fastq/MLU040LA_R1_001_val_1.fq.gz -2 ../Input_Fastq/MLU040LA_R2_001_val_2.fq.gz -b ./barcode/mi55.txt --inline_inline -r -t 130 --renz_1 MspI --renz_2 BamHI -o $SCRATCH/3RAD/Output_processradtag/ -i gzfastq

File Retained Reads Low Quality Barcode Not Found RAD cutsite Not Found Total

MLU040LA_R1_001_val_1.fq.gz 942087 744543 74080 958518 2719228

Total Sequences 2719228

Barcode Not Found 74080

Low Quality 744543

RAD Cutsite Not Found 958518

Retained Reads 942087

Barcode Filename Total NoRadTag LowQuality Retained

GGTCTACGTA-GTATGCAC MLU_040_LA 2645148 958518 744543 942087

MLU040LA_R1_001_fastqc.html

MLU040LA_R1_001_val_1_fastqc.html

Alice Mouton

unread,

Mar 26, 2021, 10:18:32 AM3/26/21

to Stacks

HI Julian, I know I am getting back months later. I have been working on my successful reads but I need to do new libraries on new samples and I want to understand why I have so much drops in my sequences. yes you were right about the N. Some of my reads have one N somewhere at the beginning. It s likely the reason behind my reads dropping. I would like to keep these reads now that I see that this is something at the beginning of my reads. I just don't understand how to remove flags. Even when I write the single command omitting -c and -q, it still removes all my reads. In the manual these flags are written in bold so are there always working even when they are not specified in the command line?

The University cluster has Stacks2.41 installed.

Best regards

Alice Mouton

On Sunday, November 29, 2020 at 9:11:16 PM UTC+1 jcatchen wrote:

Julian Catchen

unread,

Mar 29, 2021, 10:56:37 AM3/29/21

to stacks...@googlegroups.com, Alice Mouton

Hi Alice,

If you do not specify -c or -q, then they will not be activated. If you
omit these options, you are not cleaning the reads or even examining the
phred scores, just demultiplexing and checking the cutsite (and these
functions can also be disabled). You will need to look at the log file,
which will tell you why reads are being dropped (bad barcode or bad
cutsite if you are not specifying -c or -q).

Best,

julian

Alice Mouton wrote on 3/26/21 9:18 AM:

Alice Mouton

unread,

Mar 29, 2021, 2:58:46 PM3/29/21

to Julian Catchen, stacks...@googlegroups.com

Dear Julian,

Many thanks for your reply and your patience. I resolved the problem! I have been bothering so many people and I finally found why I had half of my reads dropped. I have to share my findings for the community, I don't know how I never thought about it and to be honest this is like processradtag beginner 101...:)

I used 3RAD for my libraries (Cla/msp1/Bamh1). When I was running process radtag on my libraries done with ClaD and ClaH, I had only 44% of reads recovered. I thought I had problems with my libraries, the primers, the sequencing..I never thought about the length of the barcode. ClaD and ClaH have 11 nucleotides and when I was setting my -t at 140 for my 150 bp reads, of course it was dropping half my reads (given there were 139 length..)..even when I was omitting -c and -q... . My other barcodes have between 8 and 10 so my command with process radtag and -t 140 worked well on the other samples..

Now I did the test with -t 135 (and with -c and -q ) and I have 94% read recovery. I just ran on two samples though but I am pretty sure the failed ones will now pass successfully the next step.

If anyone has some trouble with half the reads disappearing during process radtags, please check also the length of your inline barcodes...

Many thanks again Julian,

Best regards

Alice

--

Alice Mouton Ph.D.

FNRS Postdoctoral Researcher

InBios - Conservation genetics lab

University of Liege

Office Phone: (0032)43662130

Note: You may get emails from me outside of normal working hours. Please do not feel any pressure to respond outside of your own working pattern.

https://www.researchgate.net/profile/Mouton_Alice

Reply all

Reply to author

Forward