N insertion in fastq files + demultiplex issues

568 views
Skip to first unread message

PAIGE KULZER

unread,
Mar 7, 2021, 6:14:50 PM3/7/21
to Stacks
Hello! I could use some help demultiplexing our ddRAD data with process_radtags. We have Hi-Seq 4000 paired end data back from the sequencing facility, but they did not demultiplex for us so we have "Undetermined...fastq.gz" files with an "N" proceeding each barcode in the headers as well as the sequences themselves. I will paste an example below:

@K00337:359:HGJV5BBXY:7:1101:1499:1314 2:N:0:NATCCATG+NATTCATG
NCGCATCATGAACCATTACCGTTCAAAATTCCAGAGAGACTATAATACCTGTGATATGTAGGATTACTGAGATAAATTAATGATCCAATAGCCTGTATGTTTAAACTAGATCTTTGTTAGTATTACATAGAGCTATGGGTTGTAATTTTTC
+
#A<FF<FFJJJAFJJJ<FFJJJJJJFAJJJJJJJJJJAFJAJJJAJFFJJJ<A7JJJJJJJJ<FJ-AFFFJAFJFJJFJAJJ-AAFAFJFJJJF7F-7F<JJF-7-7FFFA<-<AFFAF7F<FAJJJ----<--7A<--A-<JA-F--<--
@K00337:359:HGJV5BBXY:7:1101:1681:1314 2:N:0:NGCCATCT+NACCGAGC
NTGCTGTCATGCTCTGATATCAGGCGGCTGTGGTCACACATCTCCTCTCGCTGTGGCCGAACCAGAAGCAGATATGAATGCAGGCTGCCTAAATTCTTCCTACTGCACTCCTTTCGGAGATTGCTGATCGTATTGTACTGCCCCCAGAACC
+
#A<<F--7FJJFFJFJ-F-7FJJJ-<FFFAA-<J<JJJJJ-F-FJF<JF7JA<<-<J---7FA-<-A7AF<-AJJ---<-<-77AF--7FF--7-----777A-7-7-77<-7-7--7-7-AF7----77--A-------A)-)))))---
@K00337:359:HGJV5BBXY:7:1101:1824:1314 2:N:0:NCCTATCA+NCTACGCC
NACATGTGGCAAGAAAGGAGGAAAAAAAGAGAGGAGGAGGAGCCAGGCTATTTTTAGCAATCAGATCTTATGGAAACTAATACTGAGAAGTCACTCGTTACGATGGCGGGGAGTCTGCAATTAGCTCGCCCCACGCTCGTCCAGGCTTCTG

We are hoping to demultiplex first with only the I7 index (our i5 index consisted of N's so that we could determine PCR duplicates and therefore we do not have specified sequences for the i5 index). Then we will demultiplex once more using our inline barcodes. We cannot find a way to tell Stacks to only look at one of the barcodes in the header while ignoring the other for this first round. We have tried specifying: --index_null and --null_index to be able to only look at the i7 barcode. 

Also, we are currently losing 100% of our reads to ambiguous barcode drops (which we assume relates to this N insertion), even though we are specifying that we will allow for at least one mismatch. Does Stacks allow for an N to be at the start of the barcodes and/or the sequences? Do we need to trim our fastq files or alter our barcode files to reflect this added N?


Julian Catchen

unread,
Mar 9, 2021, 1:13:36 PM3/9/21
to 'PAIGE KULZER' via Stacks
Hi,

In theory, you should be able to specify --null_index to only read the
i7 barcode. I have never actually had a dataset like this to test, but
the code does allow for it. When you say you have an 'N' inserted in the
i7 barcode, what do you mean exactly? Does every i7 barcode in the files
start with N? How are you specifying your barcodes in the barcode file?
(If every barcode starts with an N, and you want to allow for a
mismatch, you need to substitute some base for N in your barcode file,
so the barcodes are the same length and the latter bases line up properly).


'PAIGE KULZER' via Stacks wrote on 3/7/21 5:14 PM:

PAIGE KULZER

unread,
Mar 9, 2021, 4:05:08 PM3/9/21
to Stacks
Thank you for these questions, I could have been clearer. Our barcodes were not created with an N anywhere in them, but now that we have our data back they are all starting with one. So yes, every barcode in these files now start with an N (for reasons that are still unclear to me). This N was not literally an insertion, but rather a substitution of the first base pair of an 8 bp barcode, so our barcode file has 8 bp barcodes and the index also has 8 bp barcodes that start with an N. In other words, the last 7 bp of our i7 barcodes are still intact in both files. This is why we believed that allowing for one mismatch during demultiplexing would let us overlook the ambiguous barcode whilst rescuing as much data as possible. Please let me know if I can clarify any further! I will append a few lines from our barcodes file at the bottom of my response for you to reference.

We did try running process-radtags with the --null_index flag and got the following error message: "You provided single-end barcodes but did not specify a single-end barcode type." I'm guessing that this is happening because Stacks sees two barcodes in our sequence headers.

##barcodes.txt
GATCCATG BDOW2A
GCCTATCA BDOW2B
AACAACCG BDOW2C

Julian Catchen

unread,
Mar 9, 2021, 4:53:32 PM3/9/21
to 'PAIGE KULZER' via Stacks
If all my barcodes came back with an 'N' in the first position, I would
consider that a sequencing failure and ask the core for another library
construction and sequencing lane or my money back.

That said, if you can provide me a fragment of one of your raw files,
with say a few thousand FASTQ records, via Dropbox or similar, I can see
why the code is giving you that error message.

julian

'PAIGE KULZER' via Stacks wrote on 3/9/21 3:05 PM:

PAIGE KULZER

unread,
Mar 9, 2021, 5:13:34 PM3/9/21
to Stacks
We will definitely keep that in mind, we may need to go back for resequencing soon. 

I can send you some files via dropbox before the end of the day today, thank you so much! We are so appreciative of all of your help thus far.

All the best,
Paige

Julian Catchen

unread,
Mar 16, 2021, 4:58:44 PM3/16/21
to 'PAIGE KULZER' via Stacks
Hi Paige,

I have fixed process_radtags to work on --null_index barcodes and tested
it against your files. Please try it out and let me know if it works for
you, before I release the code:

http://catchenlab.life.illinois.edu/stacks/source/stacks-2.56.tar.gz

Best,

julian

'PAIGE KULZER' via Stacks wrote on 3/9/21 4:13 PM:

Sara Laura Šarančić

unread,
Mar 29, 2023, 3:00:43 AM3/29/23
to Stacks
Hello,
I am having a similar problem going on. The sequencing service did not demultiplex for us so we also have "Undetermined...fastq.gz" files with an "N" proceeding each barcode in the headers as well as the sequences themselves. I will paste an example below:

@MG01HX05:998:HJKYFCCX2:1:1101:29589:1485 1:N:0:NAATTCGT+NAATCTTA
NGCAACTCTAATAAATTTGAAACCTCCAAGTAGTTCCAAAAATTACCAAAAACCCGACCCATTAGCCTAATAAACAACTAAACTATGATGTACTACTTAAAGGAATTAACCACTTAATATCCCACTAAGTTGCCCCTTTGTAAATAACCAC
+
#A-FFJFJJJJJJJJJJJJJFFFJ7F<FJF-<77<-7FJ77--<F<FJJFFFF<<-J----7<F77-7--FFFJ-7<F-AAJ--F--7-<-7A<7-777-<77----7A-FJF<-<<77-<A--<77<--777-77-7---7-F7<----7
@MG01HX05:998:HJKYFCCX2:1:1101:29609:1485 1:N:0:NAATTCGT+NAATCTTA
NGCAACTCTAATTAGGGTTGACTAAGCTAACTACTCCATCATATAATCAAACCCATATAATTAACCGTAAGCTTCTATCCCTCATCCCCACCCAGCCCCAGGCCAAAACCCCCGAAACCCAACTCCACTCTGACCAACTCAGCCTCAGCAC
+
#AAFFJFJFJJA<FAJA7JFAF7-<-<AFJJ-FFF--7--<---<FA-<J-7<---F-A--<FA<<--FFFA----7<--F-7--7--77A77<-7-7----7-7<7FF7F-7--FAAFAA77-7A777<---------7---<-<7-)-)
@MG01HX05:998:HJKYFCCX2:1:1101:29833:1485 1:N:0:NCTCGCGC+NAGGACGT
NGGAACTCTAATTAAATTAAACGTGACAGGCGAAAGGCAATGACATGAACATATACGCGTACCGTTAACTATATTTAAACATCGCAGTAAAAATGCAAGAGCAGTTAAGGATGCAGAGTAAAGAAGATCGGAAGAGCACACGTCGGAACTC
+
#AAFFJJJJJJJFJJJ<AJJJJJAAJJJAJ7FJJJ77<JJF<FFFFFF<7J-F--<<FA--7<AJFA<-<FAJJ-<AJJJJ7-77F---<AFFFFFJF-7-AJA--7A--<--7--7AJFFJ7AAAJ7-A--A7--7--AA--77<7F<7-

We are also having problems with demultiplexing. Is it possible to fix the N base problem by using --null_index  option? If it's not too much to ask, I would like to send you fragments of my raw data so you can get a better insight into the problem.

All the best,
Sara Laura

Sara Laura Šarančić

unread,
Mar 29, 2023, 6:44:17 AM3/29/23
to Stacks
Follow up question - does this problem have any correlation with the fact that the sequencing provider did not use phiX? Would the use of phiX have made any difference in the quality of the results?

Thanks in advance
Sara Laura

Catchen, Julian

unread,
Mar 29, 2023, 9:38:19 AM3/29/23
to stacks...@googlegroups.com

Hi Sara,

 

The lack of phiX in the sequencing lane can cause the Illumina machine to mistake the inline barcode (or restriction enzyme cutsite remnant) for overlapping clusters of molecules on the flow cell (not sure if this is still the case for patterned flow cells on newer Illumina machines). However, for the i5 and i7 indexed barcodes, these are done by separate sequencing runs on the machine to specifically read the regions of the molecule that contain these barcode sequences, so phiX should not be a factor there.

 

I would go back to your sequencing provide and ask them what went wrong with the index barcodes (not sure how many of them have “N”s in them). For the N at the start of the sequence, if you enable correction, process_radtags will fix one incorrect nucleotide per cut site remnant per read, so it may be able to correct many of those.

 

Best,

 

julian

Sara Laura Šarančić

unread,
Mar 29, 2023, 10:24:21 AM3/29/23
to Stacks
Thank you for answering Julian!

I will definitely contact the service provide about this matter.

I tried demultiplexing with this command:

process_radtags -i gzfastq -1 ~/Sara_raw/Undetermined_S0_L001_R1_001.fastq\ \(2\).gz -2 ~/Sara_raw/Undetermined_S0_L001_R2_001.fastq\ \(2\).gz -o ./ -b ~/Sara_raw/barcodes_1st_run.txt -c -q -r --null_index --renz-1 aseI --renz-2 nsiI

This is what I got back:

Processing paired-end data.
Using Phred+33 encoding for quality scores.
Found 1 paired input file(s).
Searching for single, index barcode.
Loaded 96 barcodes (8bp).
Will attempt to recover barcodes with at most 1 / 1 mismatches.
Processing file 1 of 1 [Undetermined_S0_L001_R1_001.fastq (2).gz]
  Reading data from:
  /home/sara/Sara_raw/Undetermined_S0_L001_R1_001.fastq (2).gz and
  /home/sara/Sara_raw/Undetermined_S0_L001_R2_001.fastq (2).gz
  Processing pairs of RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...
  17130308 total reads; -17116252 ambiguous barcodes; -13855 ambiguous RAD-Tags; +14125 recovered; -0 low quality reads; 201 retained reads.
Closing files, flushing buffers...done.

17130308 total sequences
17116252 barcode not found drops (99.9%)
       0 low quality read drops (0.0%)
   13855 RAD cutsite not found drops (0.1%)
     201 retained reads (0.0%)

Details logged: './process_radtags.Sara_raw.log'
For a summary, execute one or more:
  stacks-dist-extract ./process_radtags.Sara_raw.log total_raw_read_counts
  stacks-dist-extract --pretty ./process_radtags.Sara_raw.log per_barcode_raw_read_counts
process_radtags is done.

Did I use the correct barcode option? Just for the reference, the first few lines in my 1st run barcode file look like this (they are not revcomp):

ATTACTCG AGGCTATA 1-1_47-3
TCCGGAGA AGGCTATA 1-5_47-5
CGCTCATT AGGCTATA 1-10_47-6
GAGATTCC AGGCTATA 1-15_49-2
ATTCAGAA AGGCTATA 21-1_49-3
GAATTCGT AGGCTATA 21-2_49-5
CTGAAGCT AGGCTATA 21-5_49-6
TAATGCGC AGGCTATA 21-8_15-1
CGGCTATG AGGCTATA 2-1_15-3
TCCGCGAA AGGCTATA 2-5_15-6
TCTCGCGC AGGCTATA 2-10_15-8
AGCGATAG AGGCTATA 2-20_28-2
ATTACTCG GCCTCTAT 20-4_28-4
TCCGGAGA GCCTCTAT 20-6_28-7
CGCTCATT GCCTCTAT 20-8_28-8
GAGATTCC GCCTCTAT 20-10_28-9
ATTCAGAA GCCTCTAT 40-3_24-3
GAATTCGT GCCTCTAT 40-5_24-4
CTGAAGCT GCCTCTAT 40-6_24-9
TAATGCGC GCCTCTAT 40-7_24-10
CGGCTATG GCCTCTAT 40-8_27-1
TCCGCGAA GCCTCTAT 6-2_1-1-K2
TCTCGCGC GCCTCTAT 6-10_13-5-K2
AGCGATAG GCCTCTAT 6-15_16-1-K2
ATTACTCG AGGATAGG 6-20
TCCGGAGA AGGATAGG 7-1
CGCTCATT AGGATAGG 7-10
GAGATTCC AGGATAGG 7-14
ATTCAGAA AGGATAGG 7-18
GAATTCGT AGGATAGG 3A-1
CTGAAGCT AGGATAGG 3A-5
TAATGCGC AGGATAGG 3A-10
CGGCTATG AGGATAGG 3B-1
TCCGCGAA AGGATAGG 3B-5
TCTCGCGC AGGATAGG 3B-9
AGCGATAG AGGATAGG 4-1
ATTACTCG TCAGAGCC 4-5
...

Should first and second columns be in original shape as I got them from the service or should I do revcomps of i7 and i5?

In the second part of demultiplexing I have to demultiplex the first 24 out of 96 samples again, and here is an example of a barcode file for the second round (they are revcomps):
name of the file is: barcodes_1-1_47-3
GAGTT 1-1
GTAGG 47-3

Is there anything else I can do or try to successfully demultifplex my data?

Any help would be greatly appreciated!

Best,
Sara

Catchen, Julian

unread,
Mar 29, 2023, 10:29:06 AM3/29/23
to stacks...@googlegroups.com

Based on your FASTQ snippet from your earlier message, I think you have --index-index barcodes (both the i5 and i7 in the FASTQ header), but I don’t know what RAD protocol you used. The --null-index option would be just an i7 barcode.

Sara Laura Šarančić

unread,
Mar 29, 2023, 10:48:55 AM3/29/23
to Stacks
I also tried to do it with --index-index barcode and got this back:

Processing paired-end data.
Using Phred+33 encoding for quality scores.
Found 1 paired input file(s).
Searching for single and paired-end, indexed barcodes.
Loaded 96 barcodes (8bp / 8bp).

Will attempt to recover barcodes with at most 1 / 1 mismatches.
Processing file 1 of 1 [Undetermined_S0_L001_R1_001.fastq (2).gz]
  Reading data from:
  /home/sara/Sara_raw/Undetermined_S0_L001_R1_001.fastq (2).gz and
  /home/sara/Sara_raw/Undetermined_S0_L001_R2_001.fastq (2).gz
  Processing pairs of RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...
  17130308 total reads; -17055010 ambiguous barcodes; -74289 ambiguous RAD-Tags; +75221 recovered; -0 low quality reads; 1009 retained reads.

Closing files, flushing buffers...done.

17130308 total sequences
17055010 barcode not found drops (99.6%)

       0 low quality read drops (0.0%)
   74289 RAD cutsite not found drops (0.4%)
    1009 retained reads (0.0%)


Details logged: './process_radtags.Sara_raw.log'
For a summary, execute one or more:
  stacks-dist-extract ./process_radtags.Sara_raw.log total_raw_read_counts
  stacks-dist-extract --pretty ./process_radtags.Sara_raw.log per_barcode_raw_read_counts
process_radtags is done.

The RAD protocol we used is GBS (Genotyping By Sequencing).

Thank you for your help!

Sara Laura Šarančić

unread,
Mar 30, 2023, 2:53:47 AM3/30/23
to Stacks
Just a quick update.
This is the answer I got back from the sequencing provide:

"Hello,
Thank you for reaching out, I spoke with the team. The N's are from the
run, this has no cause and is due to low qualify of the run at that
position. It is normal when sequencing, we can not guarantee every
position in every cycle. Please let me know if you have any questions."

To be more precise, we used:

1) two-enzymes digestion,
2) double adapters ligation, each specific for each enzyme cutting site, one containing the i5, other the i7 index,
3) PCR nick repair, and
4) final PCR amplification, with few clean-ups along the way)

To sum up: double digestion - double adapters ligation - nick repair - final PCR, with few clean-ups along the way.

Thank you for your help!
Reply all
Reply to author
Forward
0 new messages