Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Issues with GBSSeqToTagDBDPlugin and some old seq data?

50 views
Skip to first unread message

Daniel Pap

unread,
Feb 3, 2025, 3:35:27 PMFeb 3
to TASSEL - Trait Analysis by Association, Evolution and Linkage
I am trying to run some 10years old GBS illumina fastq files. Previously this set of data was analyzed with UNEAK (many years ago) and now we like to use it with genome seq data. 

@ GBSSeqToTagDBPlugin  
#removeTagsWithoutReplication. Current tag number: 580156
#Finished removeTagsWithoutReplication.  tagsRemoved = 579278. Current tag number: 878
#Map Tags:878  Memory:131,402  TotalDepth:19,039  AvgDepthPerTag:21

Tried adjusting the Kmer size and even the cutadapt parameters with no success, wondering if there is an issue with the adapters. I run out of ideas. 


Matthew Peterson

unread,
Feb 4, 2025, 12:15:43 AMFeb 4
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Daniel,

Hopefully someone on the list can suggest a modern way to generate GBS library metrics from a FASTQ file.

As a "sanity check," you can generate GBS library metrics, independent of the TASSEL v3 pipeline, using:

buildtassel3/stats/countbarcodes.pl at main · petersm3/buildtassel3

Feel free to contact me off list if you have any questions about running the script.

Thank you,
Matthew
---
grep -A4 "^# Purpose" countbarcodes.pl
# Purpose : From Illumina GBS lane(s) count the number of occurrences of:
# - barcodes with restriction sites, any length, with Ns
# - barcodes with restriction sites, remove the barcode, truncate to 64
#   bases and then filter out reads with Ns (aka TASSEL 3 processing)
# Produce a CSV with the resulting values to stdout

./countbarcodes.pl

Usage: ./countbarcodes.pl -i fastq_input_dir -k key_file -e enzyme_file > countbarcodes.csv

-i, --input
       Directory containing the gzipped FASTQ file(s) to be filtered
       TASSEL format filenames, e.g., code_FLOWCELL_s_LANE_fastq.txt.gz
       Important filename identifiers: _FLOWCELL_, _LANE_, fastq, gz
       FLOWCELL and LANE will match columns 1 and 2 in the TASSEL key file
-k, --key
       TASSEL key file (CSV or TSV format accepted)
-e, --enzyme
       Text file containing the enzyme used to cut the GBS lane
       See (TASSEL 3) TasselPipelineGBS.pdf page 8 for a list of enzymes
-h, --help
       This usage information.

--
Example of file naming conventions, expected by the script:

cat enzyme.txt
PstI-MspI

head -n5 key.csv
Flowcell,Lane,Barcode,Sample,PlateName,Row,Column,LibraryPrepID,Comments
AAF327YM5,1,TGACGCCA,A1,Plate1,A,1,,A1
AAF327YM5,1,CAGATA,B1,Plate1,B,1,,B1
AAF327YM5,1,GAAGTG,C1,Plate1,C,1,,C1
AAF327YM5,1,TAGCGGAT,D1,Plate1,D,1,,D1

FASTQ filename: ALL_AAF327YM5_s_1_fastq.txt.gz
Reply all
Reply to author
Forward
0 new messages