Issues with GBSSeqToTagDBDPlugin and some old seq data?

68 views

Skip to first unread message

Daniel Pap

unread,

Feb 3, 2025, 3:35:27 PM2/3/25

to TASSEL - Trait Analysis by Association, Evolution and Linkage

I am trying to run some 10years old GBS illumina fastq files. Previously this set of data was analyzed with UNEAK (many years ago) and now we like to use it with genome seq data.

@ GBSSeqToTagDBPlugin

#removeTagsWithoutReplication. Current tag number: 580156
#Finished removeTagsWithoutReplication. tagsRemoved = 579278. Current tag number: 878
#Map Tags:878 Memory:131,402 TotalDepth:19,039 AvgDepthPerTag:21

Tried adjusting the Kmer size and even the cutadapt parameters with no success, wondering if there is an issue with the adapters. I run out of ideas.

Matthew Peterson

unread,

Feb 4, 2025, 12:15:43 AM2/4/25

to TASSEL - Trait Analysis by Association, Evolution and Linkage

Daniel,

Hopefully someone on the list can suggest a modern way to generate GBS library metrics from a FASTQ file.

As a "sanity check," you can generate GBS library metrics, independent of the TASSEL v3 pipeline, using:

buildtassel3/stats/countbarcodes.pl at main · petersm3/buildtassel3

Feel free to contact me off list if you have any questions about running the script.

Thank you,
Matthew
---
grep -A4 "^# Purpose" countbarcodes.pl
# Purpose : From Illumina GBS lane(s) count the number of occurrences of:
# - barcodes with restriction sites, any length, with Ns
# - barcodes with restriction sites, remove the barcode, truncate to 64
# bases and then filter out reads with Ns (aka TASSEL 3 processing)
# Produce a CSV with the resulting values to stdout

./countbarcodes.pl

Usage: ./countbarcodes.pl -i fastq_input_dir -k key_file -e enzyme_file > countbarcodes.csv

-i, --input
Directory containing the gzipped FASTQ file(s) to be filtered
TASSEL format filenames, e.g., code_FLOWCELL_s_LANE_fastq.txt.gz
Important filename identifiers: _FLOWCELL_, _LANE_, fastq, gz
FLOWCELL and LANE will match columns 1 and 2 in the TASSEL key file
-k, --key
TASSEL key file (CSV or TSV format accepted)
-e, --enzyme
Text file containing the enzyme used to cut the GBS lane
See (TASSEL 3) TasselPipelineGBS.pdf page 8 for a list of enzymes
-h, --help
This usage information.

Example of file naming conventions, expected by the script:

cat enzyme.txt
PstI-MspI

head -n5 key.csv
Flowcell,Lane,Barcode,Sample,PlateName,Row,Column,LibraryPrepID,Comments
AAF327YM5,1,TGACGCCA,A1,Plate1,A,1,,A1
AAF327YM5,1,CAGATA,B1,Plate1,B,1,,B1
AAF327YM5,1,GAAGTG,C1,Plate1,C,1,,C1
AAF327YM5,1,TAGCGGAT,D1,Plate1,D,1,,D1