Are the cutsites present in my samples?

51 views

Skip to first unread message

Laís Aline Grossel

unread,

Apr 16, 2025, 6:07:26 PMApr 16

to Stacks

Hi everyone,

I used 3RAD, with enzymes claI, ecoRI and mspI, to sequence 2500 samples (double-stranded) through Illumina. The company sent me the files demultiplex already, so I didn't have to remove barcodes. Actually, I still can see them in the header of the files, but not in the sequences, so I don't think is it an issue, right?

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9III--999-----9---9-9--------9-9--9I9--99I999999-999I9999-99999IIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@LH00401:259:22HKMYLT4:4:1101:29048:1140 1:N:0:ACTGTGTC+AGGATAGG
AACTCGTCATCGATGCCTGTACATCGTGCTTCTCTGCCAAACAGCGGTGTAAGATACAGCCTACCCGTAATCGGAAGGTGTTAAGGGGGGGCTTATTTTCGGAATTACAACGATCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACA

My question is specifically about the cutsites. When I run process_radtags, I have around 0.1% of good reads, while almost everything is in "RAD cutsite not found drops".

My code is:

process_radtags -P -p files_1 -o radtag_1 -c -q -r --renz_1 claI --renz_2 ecoRI

I found another conversation here in which the person was using an enzyme in the code, but his sequences had another sequence of bases at the beginning. Then I checked a few sequences and realized that their beginning is the same for the reads of a sample, but they are different between samples. Let me give you an example (with one read per sample. I highlighted the sequences that match my enzymes claI or ecoRI):

amostra 1:

GATCGTTGGAATTCACCACCCACAAATAGAGTGGCGTATGCATCTTTTAACGTAGCACGATTGTGTATATGATTACAAATCATAATTGGTGATGCCGAGCTGTGAATGTTGTATAATTGCGTAGTCTTGGCAGCTGGGGTAGAAAGATCTTCGGCCCTTAGCTGACAGCTCTAGGGTTCGAGCGCCGCATACAGAAATACATGAATGTTGGCACACTGGAATGTCACACTGGCATGGCCAGGAAACGCAT

amostra 2: GGTCTACGTATCGATATTTTATTTTTGCTCAAATATAAGACGAGGGTACGATTTTGCATGTTGAATTGATGAAAAAAGGGTAGTCTTATATTTAAGCTAGTACTGTAAGCTGGATTACTAATATCACAGACCTACGACTAAAAACCTGACTTAATTTCTATGATTGATTCTTATATATTTTTAGGCGGCGAATTAAGTGTAGCTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGATGTAGATCT

amostra 3: AGCTACACTTAATTCGCCGCCTAAAAATATATAAGAATCAATCATAGAAATTAAGTCAGGTTTTTAGTCGTAGGTCTGTGATATTAGTAATCCAGCTTACAGTACTAGCTTAAATATAAGACTACCCTTTTTTCATCAATTCAACATGCAAAATCGTACCCTCGTCTTATATTTGAGCAAAAATAAAATATCGATACGTAGACCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTTCAGAGCCGTGTG

I still can find the cutsites, but in the middle, or some bases away from the beginning. Is that how I should see the cutsites, or they should be exactly in the beginning of the sequences? If they should, can this pattern be a signal that the company also removed the cutsites? (I read that depending on the protocol of the company, they could remove barcodes + cutsites). I don't have any information about what was removed, only that the file was demultiplexed already.

I would appreciate any help, because I am very worried with these results.

And I apologize in advance if my question is dumb, since this is my first experience working with molecular data.

Thank you!

All the best,

Laís

Angel Rivera-Colón

unread,

Apr 17, 2025, 9:16:01 PMApr 17

to Stacks

Hi Laís,

A few notes on this. First, whenever possible, I would try to request access to the unprocessed data (i.e., without demultiplexing) from the sequencing facility. Given that there are some possible issues with cutsites/barcodes, we want to make sure that we can account for all the steps in the analysis. As you'll see below, there might be something going on with this step that is likely impacting the analysis.

Secondly, since you see the same sequence at the beginning of the reads for a sample (but different between samples), I would think that this is an in-line barcode that it is still present in the data. Depending on the exact library construction, you might have both index and in-line barcodes present (3RAD supports a combination of both, if I am not mistaken). Going back to my first point, depending on what how the sequencing facility processed the data, they might have both been accounted for properly. Regardless, my guess would be that these are still in-line barcodes that still need to be processed by adding a barcode file to process_radtags.

Regarding the 3RDAD indexes/barcodes, but this protocol from the Faircloth lab (the developers of the 3RAD protocol) mention that there are some manual modification that have to be added to the indexes/barcodes by adding some fixes based depending on the bases used (https://protocols.faircloth-lab.org/en/latest/protocols-computer/analysis/analysis-three-rad.html#steps). The example there is for different enzymes, but you might also encounter a similar issue depending on the exact protocol.

Lastly, regarding enzymes. The 3RAD protocol uses a combination of 3 restriction enzymes, but only 2 of them appear in the actual reads (the 3rd one takes care of adapter dimmers). Double check which 3 enzymes were used and which are the 2 expected to appear in the sequencing reads. In the exact case you point out above, the immediate issue is likely to those putative barcodes that appear before the enzyme; however, you might still encounter a problem with the enzymes downstream if they are specified incorrectly in the software. From the set of sequences you highlight above, I see the G|AATTC that should correspond to ecoRI. The AT|CGAT also matches the claI; however, I wouldn't expect both to be present at the 5' end of the read.

Hope this information is helpful.

Thanks,

Angel

Reply all

Reply to author

Forward

0 new messages