process_radtags RAD-seq data returning 57.5% "barcode not found" drops due to i5 barcode

90 views
Skip to first unread message

Liz Boggs

unread,
May 9, 2025, 4:13:52 PMMay 9
to Stacks
Hello all!

I searched previous conversations for this issue and found similar threads, but none with my exact flavor of problem. I have tried many iterations of testing and still have not found success, so here I am!

I have 389 individuals across 5 different libraries, which were sent to a sequencing facility to be pooled there and then sequenced in one lane of a NovaSeq S4 run. Here is what my raw data files look like (specific IDs will be obfuscated in the following bits of data and code in <> format):

  • <service_ID>_S0_L003_R1_001.fastq.gz
  • <service_ID>_S0_L003_R2_001.fastq.gz
  • <service_ID>_S0_L003_I1_001.fastq.gz
  • <service_ID>_S0_L003_I2_001.fastq.gz
So I have all 5 of my libraries in my R1 and R2 read files. They are NOT demultiplexed to the plate/library level at all. Given that, I need to have a barcode file that can pull reads out based on both the individual 10bp barcode, which is inline in the reads, the and 6bp library barcode in the header. Here is a peek at the head of my R1 file:

@A01335:621:H5WGLDSXF:3:1101:1036:1000 1:N:0:GCCAAT+AGATCT
GGACAGCAGATGCAGGCATGCGCCATGGAGATTAGCTACACCGCAGTGGCGTGGAACACATAACAGGCTCATTAGGTAGCATATTCTCCAGGAGGCATGCGCCTATGGAGAATAGCCACACCGCAGTGGCGTGGAACACATAACAGGCTCACTAGGTAT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF:FFF,FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFF,FFFFFF:FFFFFFFFFFFFFF:,FFFFFFFFFFFF:FF::,
@A01335:621:H5WGLDSXF:3:1101:1072:1000 1:N:0:CGATGT+AGATCT
CGGAGGAGCAGGTGTTGATGGGGCCGTTCTTCATCTGCTCCAGAGTGTCCAGCACCACGTCGGGAGCCGGCCGGCAGCCCTGCCGCTCCACCTGCCTCAACTCGCCCACAGCCACCGTCACCGTCACCTCGAAGTCCTGCATGTCTATCCCAGATCGGA
+
F:FFFFF:F:FFFF,::F:FF:,F,FFF::F:FF:F,F,::FFFF,,F,FF,FFFFF:,FFFFFF,FF:FF:,:FF,FFF,::F,F,:F:,FF:FF,:FFF,,,,FF,,F,:::::,,FFFF,,:,:F::,F,:,F,,,F:,:F:,F,,FFFFFF,,,,
@A01335:621:H5WGLDSXF:3:1101:1090:1000 1:N:0:ATCACG+AGATCT
CCACCACACAGCCGCCCTCCAGCTCACCACGGCCCTCAGCCTCTTCACGAGCTGCAACCACCACACAGCTGCCCTCCACCTCACCACGGCCCTCAGCCTCTTTGCGAGCTGCCCCCTCCACACAGCTGCTCTCCACCTCACCACGGCCCTCAGCCTCTT

So the first 6bp barcode in the index+index part of the header is my library barcode. The second in that pair is the i5 Illumina index that is the same for all reads. Here is a peek at the barcode file I have been using - again, 389 individuals in total:

GGTTCACGCA      CGATGT  BR50068
GGACACGAGA      CGATGT  QB50096
GGAAGAGATC      CGATGT  QB50102
GGAAGGACAC      CGATGT  QB50110
GGAATCCGTC      CGATGT  QB50118

First column is individual barcode, second is library, third is sample ID. Tabs in between each. The library barcode obviously changes with each library.

I attached the script I used for `process_radtags` to this message; the important factor is that the "most successful" script, ie. the one that returned the highest % reads retained, used the `--inline_index' flag, as did my colleagues who had very similar RAD data.

I used the flag `--inline_index` since it seemed like the right one for my scenario, and additionally, using other flags (such as `--index_index` and '--index_inline') just to try them out got my script kicked immediately or ran for a while and retained 0% reads. That being said, I heavily believe this is a barcode issue, and the below is why.

The "successful" job ran for 63.5 hours and spat out the correct amount of files (4 for each ind: two of the reads, two of the remainders); however, here is the slurm.out file for this job, detailing the reads kept and discarded:

```
  6228064294 total reads; -0 failed Illumina reads; -3580073710 ambiguous barcodes; -28122894 ambiguous RAD-Tags; +83758324 recovered; -5554816 low quality reads; 2563948807 retained reads.
    50364067 reads with adapter sequence.
Closing files, flushing buffers...done.

6228064294 total sequences
1494217429 reads were transposed [based on the BestRAD flag] (24.0%)
         0 failed Illumina filtered reads (0.0%)
  50364067 reads contained adapter sequence (0.8%)
3580073710 barcode not found drops (57.5%)
   5554816 low quality read drops (0.1%)
  28122894 RAD cutsite not found drops (0.5%)
2563948807 retained reads (41.2%)
```
So it's dropping a lot of the reads. Here is what is returned for looking at barcodes not found:

```
BEGIN barcodes_not_recorded
# A list and count of barcodes found that were not specified in the barcodes input file.
Barcode Total
GGGAGCTGAA-AGATCT       104988366
GGAAACATCG-AGATCT       88023132
GGCCGTGAGA-AGATCT       55435788
GGGACAGTGC-AGATCT       52357626
GGCGAACTTA-AGATCT       51776254
GGACCACTGT-AGATCT       50442810
GGGATAGACA-AGATCT       50208624
GGCATCAAGT-AGATCT       46791976
GGAACGCTTA-AGATCT       44114520
GGATCCTGTA-AGATCT       44070132
GGTCCGTCTA-AGATCT       42453748
GGAGAGTCAA-AGATCT       40700258
GGCCTCCTGA-AGATCT       40518226
GGCGCATACA-AGATCT       40145892
GGTGGTGGTA-AGATCT       38694062
GGCAAGACTA-AGATCT       38661102
GGTGAAGAGA-AGATCT       37881860
GGAATCCGTC-AGATCT       37656988
--
END barcodes_not_recorded
```

So there we have it - it's pairing individual barcodes with the i5 Illumina index, then it subsequently isn't able to match them to any of the individual samples that it already correctly identified using the individual and library barcodes (because I do still end up with 389 demultiplexed individuals in my output files - they're just all lower in read count because of the above issue).

That should cover all of my immediate bases here. I have already tried many of the other index flags - most of which kick my script immediately - and I also tried a different barcode file wherein I appended "+AGATCT" to every library barcode to see if it would pull them based on that, and that also failed as it couldn't be properly recognized by process_radtags.

So, I'm not sure what to try next. How do I get it to properly ignore that AGATCT barcode without having to permanently edit my raw files? I have been trying to fix this for weeks! :(

Happy to provide any other info where needed! Thanks in advance!

process_radtags_generic_inlineindex.txt

Rafał Wóycicki

unread,
May 10, 2025, 1:51:41 AMMay 10
to Stacks
Hi Liz,

When you use --inline_index script understands it, as you would have both inline barcode on R1 read and index barcode i5 from second read.
As you can see the script does not you to specify the i5 sequence as it finds it in the header.
There is no option to demultiplex reads having both inline and index barcode on R1 reads.
As your index barcode is per library/plate, I would first demultiplex it with some other home made script which reads the header.
Then I would use --inline_null on library demultiplexed sets of reads with the barcode file having just the inline barcode.

Other way would be to exchange the i7 index with i5 index in the header and go with your solution.

Hope that it helps.

Best,
Rafał

Rafał Wóycicki

unread,
May 10, 2025, 10:50:10 AMMay 10
to Stacks
Hi Liz,

Maybe this script could help in demultiplexing by header:
Best,
Rafał

Tomas Hrbek

unread,
May 10, 2025, 11:39:06 AMMay 10
to Liz Boggs, stacks...@googlegroups.com
Hi Liz,

You can use seqkit to demux fasta files where index information is in
the header.

seqkit grep -r -n -p $INDEX $INFILE -o $OUTFILE

In your case your:
INDEX=".GCCAAT\+.AGATCT"
INFILE="<service_ID>_S0_L003_R1_001.fastq.gz"
OUTFILE="library_x_R1_001.fastq.gz"

You will need to do this for all five libraries and both R1 and R2
reads, so best to write a loop.

Once you have your library_x_R1_001.fastq.gz and
library_x_R2_001.fastq.gz, you can extract the inline barcodes using
cutadapt, outputting those reads that have the inline barcodes to a file.

cutadapt -e 1 --no-indels -g $BARCODE -o library_x_R1_001.fastq.gz
-p library_x_R2_001.fastq.gz $SAMPLE_R1_001.fastq.gz $SAMPLE_R2_001.fastq.gz

where:
BARCODE="GGTTCACGCA"

This will look for the BARCODE from the 5' to 3' direction in the R1
read. If your BARCODE is exactly at the start of the R1 read, I suggest
your anchor it (-g $^BARCODE) so it will only consider those R1 reads
that start with BARCODE. If your inline barcodes are on the R2 read, you
will need to use the -G flag instead.

Again, it will be best to write this as a loop, and feed a list of
samples and inline barcodes to the loop.

I know this is not a STACKS based solution, but it is a general solution
that works for any type of FASTQ data with inline barcodes and index
information in the header.

All the best,

Tomas


On 5/9/25 16:09, Liz Boggs wrote:
> Hello all!
>
> I searched previous conversations for this issue and found similar
> threads, but none with my exact flavor of problem. I have tried many
> iterations of testing and still have not found success, so here I am!
>
> I have 389 individuals across 5 different libraries, which were sent
> to a sequencing facility to be pooled there and then sequenced in one
> lane of a NovaSeq S4 run. Here is what my raw data files look like
> (specific IDs will be obfuscated in the following bits of data and
> code in <> format):
>
> * <service_ID>_S0_L003_R1_001.fastq.gz
> * <service_ID>_S0_L003_R2_001.fastq.gz
> * <service_ID>_S0_L003_I1_001.fastq.gz
> * <service_ID>_S0_L003_I2_001.fastq.gz
>
> So I have all 5 of my libraries in my R1 and R2 read files. /They are
> NOT demultiplexed to the plate/library level at all./ Given that, I
> need to have a barcode file that can pull reads out based on both the
> /individual/ 10bp barcode, which is inline in the reads, the and 6bp
> *library* barcode in the header. Here is a peek at the head of my R1 file:
>
> @A01335:621:H5WGLDSXF:3:1101:1036:1000 1:N:0:GCCAAT+AGATCT
> GGACAGCAGATGCAGGCATGCGCCATGGAGATTAGCTACACCGCAGTGGCGTGGAACACATAACAGGCTCATTAGGTAGCATATTCTCCAGGAGGCATGCGCCTATGGAGAATAGCCACACCGCAGTGGCGTGGAACACATAACAGGCTCACTAGGTAT
> +
> FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFF:FFF,FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFF,FFFFFF:FFFFFFFFFFFFFF:,FFFFFFFFFFFF:FF::,
> @A01335:621:H5WGLDSXF:3:1101:1072:1000 1:N:0:CGATGT+AGATCT
> CGGAGGAGCAGGTGTTGATGGGGCCGTTCTTCATCTGCTCCAGAGTGTCCAGCACCACGTCGGGAGCCGGCCGGCAGCCCTGCCGCTCCACCTGCCTCAACTCGCCCACAGCCACCGTCACCGTCACCTCGAAGTCCTGCATGTCTATCCCAGATCGGA
> +
> F:FFFFF:F:FFFF,::F:FF:,F,FFF::F:FF:F,F,::FFFF,,F,FF,FFFFF:,FFFFFF,FF:FF:,:FF,FFF,::F,F,:F:,FF:FF,:FFF,,,,FF,,F,:::::,,FFFF,,:,:F::,F,:,F,,,F:,:F:,F,,FFFFFF,,,,
> @A01335:621:H5WGLDSXF:3:1101:1090:1000 1:N:0:ATCACG+AGATCT
> CCACCACACAGCCGCCCTCCAGCTCACCACGGCCCTCAGCCTCTTCACGAGCTGCAACCACCACACAGCTGCCCTCCACCTCACCACGGCCCTCAGCCTCTTTGCGAGCTGCCCCCTCCACACAGCTGCTCTCCACCTCACCACGGCCCTCAGCCTCTT
>
> So the first 6bp barcode in the index+index part of the header is my
> *library* barcode. The second in that pair is the i5 Illumina index
> of the individual samples that it already /correctly/ identified using
> the individual and library barcodes (because I /do/ still end up with
> 389 demultiplexed individuals in my output files - they're just all
> lower in read count because of the above issue).
>
> That should cover all of my immediate bases here. I have already tried
> many of the other index flags - most of which kick my script
> immediately - and I also tried a different barcode file wherein I
> appended "+AGATCT" to every library barcode to see if it would pull
> them based on that, and that also failed as it couldn't be properly
> recognized by process_radtags.
>
> So, I'm not sure what to try next. How do I get it to properly ignore
> that AGATCT barcode without having to permanently edit my raw files? I
> have been trying to fix this for weeks! :(
>
> Happy to provide any other info where needed! Thanks in advance!
>
> --
> Stacks website: http://catchenlab.life.illinois.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stacks-users...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/stacks-users/79dc8cdc-25a6-46b2-98de-5827a849c44fn%40googlegroups.com
> <https://groups.google.com/d/msgid/stacks-users/79dc8cdc-25a6-46b2-98de-5827a849c44fn%40googlegroups.com?utm_medium=email&utm_source=footer>.


Catchen, Julian

unread,
May 12, 2025, 5:37:48 PMMay 12
to stacks...@googlegroups.com

Hi Liz,

 

If I understand your setup, you have three barcodes: two index and an inline. If that is correct, you should be able to run process_radtags in two rounds, first with --index-index and specifying the two 6bp index barcodes, just demultiplexing the files, no quality checks and no rad cutsite check. Then, for each pair of output files, you should run process_radtags again, this time specifying each pair of input files using the -1 and -2 options along with the longer, inline barcode (--inline-null) that are related to the original pair of index barcodes. In this second run, you can turn on quality checks, rad cutsite and adapter filtering.

 

Typically that first round of demultiplexing is done by the Illumina software so we I don’t think I have seen a three barcode combination before, so I haven’t tried it myself.

 

Let us know if that works.

 

Best,

 

Julian

 

--

Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.

Liz Boggs

unread,
Jul 8, 2025, 6:33:47 PMJul 8
to Stacks
Hi all!

Apologies for the super huge delay in giving an update on this issue - got caught up with work and it totally slipped my mind. The problem I was having was actually an indexing issue from the very start - my data was treated as unique dual index when the sequencing facility processed it (it was single index), so it picked up that Illumina adapter and threw it in the header with the 6bp library barcode, which is why I ran up against issues with processing. Once it was properly re-run as single index, it was able to be split into proper R1/R2 files per each library and then demuxed via process_radtags like normal using the --inline-index flag to pull the 10bp inline individual barcode. Retained about 80% of reads per ind, so I ran with that data and have been working with it since then!

That being said, I'm sure the methods suggested here would've worked - I doubt this is a common (or even rare) issue given the source of error in my data, but anything that strips or ignores that "+AGATCT" in the header would've also likely rendered the same dataset.

I appreciate the helpful responses, so thank you Rafał, Tomas, and Julian!! Even though my problem ended up being a very simple fix, the initial troubleshooting and suggestions here helped me understand the process to a deeper level :)

Best,
Liz
Reply all
Reply to author
Forward
0 new messages