GBSSeqToTagDBPlugin:processData problem in the key file

173 views
Skip to first unread message

Celia VI

unread,
Feb 20, 2024, 10:15:47 AMFeb 20
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Dear all,

I am new in TASSEL and I am having problems with this step.

This is what I am trying to run:

time /opt/tassel5-TEGenzymes/run_pipeline.pl -Xms1G -Xmx12G -fork1 -GBSSeqToTagDBPlugin -e TEGEcoT22I -i ./sequences -db ./40555-2.db -k ./keyfile/40555-2_SNPCalling_keyfile.txt -kmerLength 120 -minKmerL 20 -mnQS 20 -mxKmerNum 100000000 -batchSize 31 -endPlugin -runfork1 2>&1 | tee -a ./40555-2_GBSV2_Pipeline"$DATE".log

GBSSeqToTagDBPlugin:processData - found NO files represented in key file.
Please verify your file names are formatted correctly and that your key file contains the required headers.

Attached is the keyfile that I have, I don't have LibraryPrepID and I do not know where to find it.

These are my samples:

40555-2_1_B1_R1_fastq
40555-2_1_B12_R1_fastq
40555-2_1_B10_R1_fastq
...


I have read all the conversation regarding this problem and I cannot make it work. Any help would be really really appreciated.

Thank you in advance.

All the best,
Celia.
40555-2_SNPCalling_keyfile.xlsx

Lynn Carol Johnson

unread,
Feb 20, 2024, 11:13:09 AMFeb 20
to tas...@googlegroups.com

You have used the PlateName vs the FLowcell name for your fastq files.  GBSv2 is expecting the files to be named in one of the following fashions:

flowcell_lane_fastq

flowcell_s_lane_fastq

code_flowcell_s_lane_fastq

In your case, I would expect the files to be named as below (example for your first file)(:

HVW7GDSX7_1_fastq in the simplest version.  Or try changing to AACT_HVC7DGSX7_B1_1_fastq to keep the 4 underscores.

 

DO the same for the other files.

 

--
You received this message because you are subscribed to the Google Groups "TASSEL - Trait Analysis by Association, Evolution and Linkage" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tassel+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tassel/5fe0aa9c-d60c-439c-9f19-abea61eb2ba2n%40googlegroups.com.

Celia VI

unread,
Feb 21, 2024, 4:51:48 AMFeb 21
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Thank you for your email.

I have changed the names to  HVW7GDSX7_B10_1_fastq.gz  HVW7GDSX7_C11_1_fastq.gz  HVW7GDSX7_D12_1_fastq.gz  HVW7GDSX7_E1_1_fastq.gz   HVW7GDSX7_F2_1_fastq.gz   HVW7GDSX7_G3_1_fastq.gz and so on. And I keep getting the same error.

time /opt/tassel5-TEGenzymes/run_pipeline.pl -Xms1G -Xmx12G -fork1 -GBSSeqToTagDBPlugin -e TEGEcoT22I -i ./sequences/* -db ./40555-2.db -k ./keyfile/40555-2_SNPCalling_keyfile.txt -kmerLength 120 -minKmerL 20 -mnQS 20 -mxKmerNum 100000000 -batchSize 31 -endPlugin -runfork1 2>&1 | tee -a ./40555-2_GBSV2_Pipeline"$DATE".log ^C
[xxxxxxx@omicserver seq_modified]$ time /opt/tassel5-TEGenzymes/run_pipeline.pl -Xms1G -Xmx12G -fork1 -GBSSeqToTagDBPlugin -e TEGEcoT22I -i ./sequences -db ./40555-2.db -k ./keyfile/40555-2_SNPCalling_keyfile.txt -kmerLength 120 -minKmerL 20 -mnQS 20 -mxKmerNum 100000000 -batchSize 31 -endPlugin -runfork1 2>&1 | tee -a ./40555-2_GBSV2_Pipeline"$DATE".log
/opt/tassel5-TEGenzymes/lib/commons-codec-1.10.jar:/opt/tassel5-TEGenzymes/lib/jfreesvg-3.2.jar:/opt/tassel5-TEGenzymes/lib/biojava-alignment-4.0.0.jar:/opt/tassel5-TEGenzymes/lib/jcommon-1.0.23.jar:/opt/tassel5-TEGenzymes/lib/jfreechart-1.0.19.jar:/opt/tassel5-TEGenzymes/lib/htsjdk-2.14.0.jar:/opt/tassel5-TEGenzymes/lib/javax.json-1.0.4.jar:/opt/tassel5-TEGenzymes/lib/postgresql-9.4-1201.jdbc41.jar:/opt/tassel5-TEGenzymes/lib/json-simple-1.1.1.jar:/opt/tassel5-TEGenzymes/lib/jfxrt.jar:/opt/tassel5-TEGenzymes/lib/log4j-1.2.13.jar:/opt/tassel5-TEGenzymes/lib/guava-22.0.jar:/opt/tassel5-TEGenzymes/lib/mail-1.4.jar:/opt/tassel5-TEGenzymes/lib/avro-1.8.1.jar:/opt/tassel5-TEGenzymes/lib/colt-1.2.0.jar:/opt/tassel5-TEGenzymes/lib/ahocorasick-0.2.4.jar:/opt/tassel5-TEGenzymes/lib/snappy-java-1.1.1.6.jar:/opt/tassel5-TEGenzymes/lib/slf4j-simple-1.7.10.jar:/opt/tassel5-TEGenzymes/lib/junit-4.10.jar:/opt/tassel5-TEGenzymes/lib/ejml-0.23.jar:/opt/tassel5-TEGenzymes/lib/biojava-phylo-4.0.0.jar:/opt/tassel5-TEGenzymes/lib/biojava-core-4.0.0.jar:/opt/tassel5-TEGenzymes/lib/commons-math3-3.4.1.jar:/opt/tassel5-TEGenzymes/lib/itextpdf-5.1.0.jar:/opt/tassel5-TEGenzymes/lib/sqlite-jdbc-3.8.5-pre1.jar:/opt/tassel5-TEGenzymes/lib/je-6.0.11.jar:/opt/tassel5-TEGenzymes/lib/slf4j-api-1.7.10.jar:/opt/tassel5-TEGenzymes/lib/trove-3.0.3.jar:/opt/tassel5-TEGenzymes/lib/jhdf5-14.12.5.jar:/opt/tassel5-TEGenzymes/lib/forester-1.038.jar:/opt/tassel5-TEGenzymes/dist/sTASSEL.jar
Memory Settings: -Xms1G -Xmx12G
Tassel Pipeline Arguments: -fork1 -GBSSeqToTagDBPlugin -e TEGEcoT22I -i ./sequences -db ./40555-2.db -k ./keyfile/40555-2_SNPCalling_keyfile.txt -kmerLength 120 -minKmerL 20 -mnQS 20 -mxKmerNum 100000000 -batchSize 31 -endPlugin -runfork1
[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Version: 5.2.44  Date: May 17, 2018
[main] INFO net.maizegenetics.tassel.TasselLogging - Max Available Memory Reported by JVM: 12288 MB
[main] INFO net.maizegenetics.tassel.TasselLogging - Java Version: 11.0.22
[main] INFO net.maizegenetics.tassel.TasselLogging - OS: Linux
[main] INFO net.maizegenetics.tassel.TasselLogging - Number of Processors: 64
[main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Pipeline Arguments: [-fork1, -GBSSeqToTagDBPlugin, -e, TEGEcoT22I, -i, ./sequences, -db, ./40555-2.db, -k, ./keyfile/40555-2_SNPCalling_keyfile.txt, -kmerLength, 120, -minKmerL, 20, -mnQS, 20, -mxKmerNum, 100000000, -batchSize, 31, -endPlugin, -runfork1]
net.maizegenetics.analysis.gbs.v2.GBSSeqToTagDBPlugin
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.analysis.gbs.v2.GBSSeqToTagDBPlugin: time: Feb 21, 2024 10:49:29
Enzyme: TEGEcoT22I
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
GBSSeqToTagDBPlugin Parameters
i: ./sequences
k: ./keyfile/40555-2_SNPCalling_keyfile.txt
e: TEGEcoT22I
kmerLength: 120
minKmerL: 20
c: 10
db: ./40555-2.db
mnQS: 20
mxKmerNum: 100000000
batchSize: 31
deleteOldData: true


GBSSeqToTagDBPlugin:processData - found NO files represented in key file.
Please verify your file names are formatted correctly and that your key file contains the required headers.
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Finished net.maizegenetics.analysis.gbs.v2.GBSSeqToTagDBPlugin: time: Feb 21, 2024 10:49:29
[pool-1-thread-1] INFO net.maizegenetics.pipeline.TasselPipeline - net.maizegenetics.analysis.gbs.v2.GBSSeqToTagDBPlugin: time: Feb 21, 2024 10:49:29: progress: 100%
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - net.maizegenetics.analysis.gbs.v2.GBSSeqToTagDBPlugin  Citation: Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. (2007) TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635.

real    0m0.694s
user    0m1.327s
sys     0m0.130s

Thank you so much in advance!
40555-2_SNPCalling_keyfile.txt

Lynn Carol Johnson

unread,
Feb 21, 2024, 10:55:25 AMFeb 21
to tas...@googlegroups.com

I loaded your key file and created a couple fake fastq files using the latest names you have.  The code was able to find and process the files.

 

In your command, are you including the “*” when you give it the sequences folder?  You want this to be just the folder name with a “/” appended, e.g.
  -I /my/path/to/fastqFiles/

I don’t see anything else wrong

 

Celia VI

unread,
Feb 23, 2024, 4:02:42 AMFeb 23
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Dear Lynn,

Thank you so much for your help. After changing the names it did work, now I have another problem... I will read the manual and the questions in the group to solve it.

Current number: 0. Max kmer number: 100000000
0.0 of max tag number
WARNING: Current tagcntmap size is 0 after processing batch 4

All the best,
Celia.

gsi...@ucdavis.edu

unread,
Mar 7, 2024, 5:31:02 PMMar 7
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Hi All,

I have a similar problem and have tried every possible combination for Flowcell name, but can't seem to get the GBSeqToTag... plugin to run bc it doesn't recognize the naming convention in key file. Someone please help to advise me what they think I should be entering in Flowcell column.

Attached is a clip from key text file. There is only one flow cell name that I found in the fastq files. The fastq file names are as follows:
Plate-1_R1_001.fastq.gz
Plate-1_R2_001.fastq.gz
Plate-2_R1_001.fastq.gz
Plate-2_R2_001.fastq.gz

I have tried using the fastq file names appended to the flow cell_lane_filename_fastq.gz but no combination seems to work. 

Thanks for any advise.
Gina

GBSSeqToTagDBPlugin:processData - found NO files represented in key file.


Screenshot 2024-03-07 at 5.24.10 PM.png

Alden Perkins

unread,
Mar 7, 2024, 5:54:34 PMMar 7
to TASSEL - Trait Analysis by Association, Evolution and Linkage
I see two things to consider here:

1. The files should be named according to one of the supported conventions:
  • flowcell_lane_fastq.txt.gz
  • flowcell_s_lane_fastq.txt.gz
  • code_flowcell_s_lane_fastq.txt.gz
This means that if there are two underscores in your file names, TASSEL is expecting what comes before the first underscore to match what's in the Flowcell column in your keyfile, and TASSEL expects the information after the first underscore to match something in the Lane column of the keyfile. Therefore, you could rename your first file GW1810121513_1_fastq.txt.gz, change the first entry in your Flowcell column to say just GW1810121513, and keep the first entry in your Lane column as 1. If that doesn't work, make sure you are giving the input directly correctly:

i <Input Directory>: Input directory containing FASTQ files in text or gzipped text. NOTE: Directory will be searched recursively and should be written WITHOUT a slash after its name. (REQUIRED)

2. It looks like you're using paired end reads, and you have R1 and R2 in the file names. TASSEL does not officially support paired end reads, so I would suggest using only the R1 files (and renaming them as above or using one of the other supported naming conventions).

gsi...@ucdavis.edu

unread,
Mar 8, 2024, 3:54:58 PMMar 8
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Hi Alden,

Thanks for your reply with suggestions. However I still get an error.  :(...I believe that the input directory is correct. Can you please have a look? Thanks for your help :) - Gina

Here are the names of only the Read1 files that is the same as the flowcell and lane.
GW1810121513_1_fastq.text.gz
GW1810121513_2_fastq.text.gz

Attached is a clip from key file to show that  flowcell name, GW1810121513, and the lane/s 1 or 2.

Here is a clip from the .sh file

## Set variables

# Working directory
WD=/Users/ginas/gbs/15_84_15_55

# Name of input directory
INPUT=$WD/input
# Name of input directory with FASTQ
FASTQDIR=$INPUT/fastq_files
# Name of database
DBNAME=$WD/database/cranberry_gbs_discovery.db
# Name of keyfile
KEY=$INPUT/cranberry-15.01_key.txt
# KEY=$INPUT/cranberry_gbs_unique_keys_resolved_duplicates.txt

# Name of tag fasta
TAGFASTA=$WD/tags/gbs_tags_for_alignment.fa.gz
# Name of unzipped tag fasta
TAGFASTAUZ=$WD/tags/gbs_tags_for_alignment.fa

# Name of output sam file
SAMOUT=$WD/alignment/gbs_tags_aligned_BenLearv1.sam
# Basename of reference index
REFIND=$WD/genome/BenLear_v1_bt2_index/VmBenLear

# FASTA reference genome
REF=$WD/genome/BenLear/Vmacrocarpon_BenLear_v1.fasta

# Output stat file
STATSOUTFILE=$WD/stats/snpStats.txt

# VCF Output file
OUTFILE=$WD/snps/cranberryGBS_production_snps_allUniqueKeys.vcf

## GBSSeqToTagDBPlugin
# Execute the plugin
./run_pipeline.pl -Xms1G -Xmx48G -fork1 -GBSSeqToTagDBPlugin \
-c 10 \
-db $DBNAME \
-i $INPUT \
-k $KEY \
-e PstI-MspI \
-kmerLength 64 \
-minKmerL 20 \
-mnQS 20 \
-mxKmerNum 100000000 \
-endPlugin -runfork1




This was a clip from the script. 

ginas@ufarmadm 15_84_15_55 % bash 02_run_tassel5_pipeline_allUniqueKeys.sh

./lib/biojava-genome-6.0.4.jar:./lib/htsjdk-2.24.1.jar:./lib/protobuf-kotlin-3.23.0.jar:./lib/jhdf5-14.12.5.jar:./lib/kotlin-stdlib-jdk7-1.6.10.jar:./lib/snappy-java-1.1.8.4.jar:./lib/ini4j-0.5.4.jar:./lib/scala-library-2.10.1.jar:./lib/javax.json-1.0.4.jar:./lib/biojava-alignment-6.0.4.jar:./lib/junit-4.10.jar:./lib/gs-ui-1.3.jar:./lib/commons-io-2.11.0.jar:./lib/guava-22.0.jar:./lib/sshj-0.32.0.jar:./lib/kotlin-stdlib-jdk8-1.6.10.jar:./lib/ahocorasick-0.2.4.jar:./lib/kotlin-stdlib-1.6.10.jar:./lib/jfreechart-1.0.19.jar:./lib/forester-1.039.jar:./lib/postgresql-42.6.0.jar:./lib/jackson-core-2.13.2.jar:./lib/kotlin-reflect-1.6.10.jar:./lib/colt-1.2.0.jar:./lib/jackson-databind-2.13.2.2.jar:./lib/biojava-core-6.0.4.jar:./lib/jackson-module-kotlin-2.13.2.jar:./lib/json-simple-1.1.1.jar:./lib/commons-math3-3.4.1.jar:./lib/ejml-core-0.41.jar:./lib/mail-1.4.jar:./lib/kotlinx-coroutines-core-jvm-1.6.0.jar:./lib/commons-codec-1.10.jar:./lib/log4j-api-2.21.1.jar:./lib/protobuf-java-3.23.0.jar:./lib/jackson-annotations-2.13.2.jar:./lib/jfreesvg-3.2.jar:./lib/itextpdf-5.1.0.jar:./lib/ejml-ddense-0.41.jar:./lib/slf4j-simple-1.7.10.jar:./lib/protobuf-java-util-3.23.0.jar:./lib/gs-core-1.3.jar:./lib/jcommon-1.0.23.jar:./lib/log4j-core-2.21.1.jar:./lib/sqlite-jdbc-3.39.2.1.jar:./lib/biojava-phylo-4.2.12.jar:./lib/error_prone_annotations-2.19.1.jar:./lib/fastutil-8.2.2.jar:./lib/slf4j-api-1.7.10.jar:./lib/phg.jar:./lib/trove-3.0.3.jar:./sTASSEL.jar

Memory Settings: -Xms1G -Xmx48G

Tassel Pipeline Arguments: -fork1 -GBSSeqToTagDBPlugin -c 10 -db /Users/ginasideli/gbs/15_84_15_55/database/cranberry_gbs_discovery.db -i /Users/ginas/gbs/15_84_15_55/input -k /Users/ginas/gbs/15_84_15_55/input/cranberry-15.01_key.txt -e PstI-MspI -kmerLength 64 -minKmerL 20 -mnQS 20 -mxKmerNum 100000000 -endPlugin -runfork1

[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Version: 5.2.93  Date: December 21, 2023

[main] INFO net.maizegenetics.tassel.TasselLogging - Max Available Memory Reported by JVM: 43691 MB

[main] INFO net.maizegenetics.tassel.TasselLogging - Java Version: 1.8.0_401

[main] INFO net.maizegenetics.tassel.TasselLogging - OS: Mac OS X

[main] INFO net.maizegenetics.tassel.TasselLogging - Number of Processors: 24

[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Citation: Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. (2007) TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635.

[main] INFO net.maizegenetics.tassel.TasselLogging - 

[main] INFO net.maizegenetics.tassel.TasselLogging - Tassel Using Library: Practical Haplotype Graph (PHG): Version: 1.9 Date: December 21, 2023

[main] INFO net.maizegenetics.tassel.TasselLogging - PHG Citation: Bradbury PJ, Casstevens T, Jensen SE, Johnson LC, Miller ZR, Monier B, Romay MC, Song B, Buckler ES. The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation. Bioinformatics. 2022 Aug 2;38(15):3698-3702. doi: 10.1093/bioinformatics/btac410. PMID: 35748708; PMCID: PMC9344836.

[main] INFO net.maizegenetics.pipeline.TasselPipeline - Tassel Pipeline Arguments: [-fork1, -GBSSeqToTagDBPlugin, -c, 10, -db, /Users/ginas/gbs/15_84_15_55/database/cranberry_gbs_discovery.db, -i, /Users/ginas/gbs/15_84_15_55/input, -k, /Users/ginas/gbs/15_84_15_55/input/cranberry-15.01_key.txt, -e, PstI-MspI, -kmerLength, 64, -minKmerL, 20, -mnQS, 20, -mxKmerNum, 100000000, -endPlugin, -runfork1]

net.maizegenetics.analysis.gbs.v2.GBSSeqToTagDBPlugin

[pool-2-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.analysis.gbs.v2.GBSSeqToTagDBPlugin: time: Mar 8, 2024 13:10:10

[pool-2-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - 

GBSSeqToTagDBPlugin Parameters

i: /Users/ginas/gbs/15_84_15_55/input

k: /Users/ginas/gbs/15_84_15_55/input/cranberry-15.01_key.txt

e: PstI-MspI

kmerLength: 64

minKmerL: 20

c: 10

db: /Users/ginas/gbs/15_84_15_55/database/cranberry_gbs_discovery.db

mnQS: 20

mxKmerNum: 100000000

batchSize: 8

deleteOldData: true

GBSSeqToTagDBPlugin:processData - found NO files represented in key file.

Please verify your file names are formatted correctly and that your key file contains the required headers.


Screenshot 2024-03-08 at 3.48.57 PM.png

Alden Perkins

unread,
Mar 8, 2024, 4:50:35 PMMar 8
to TASSEL - Trait Analysis by Association, Evolution and Linkage
I wonder if there is some strange formatting in your key file. Are you able to share the text file itself? 

gsi...@ucdavis.edu

unread,
Mar 8, 2024, 5:05:41 PMMar 8
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Please see attached key file.
cranberry-15.01_key.txt

Alden Perkins

unread,
Mar 8, 2024, 5:22:44 PMMar 8
to TASSEL - Trait Analysis by Association, Evolution and Linkage
That's strange! I was able to run GBSSeqToTagDBPlugin using your key file with no problem, so I would check the file names and paths again. One thing I noticed is that you wrote the file name as 'GW1810121513_1_fastq.text.gz' above, but that should give an error because it needs to end with 'fastq.txt.gz'. 

To test the key file, I named a fastq 'GW1810121513_1_fastq.txt.gz', put it in a folder called 'files' and it worked. I'll attach a screenshot in case you notice anything that's different. 
tassel1.png
tassel2.png

gsi...@ucdavis.edu

unread,
Mar 11, 2024, 2:23:57 PMMar 11
to TASSEL - Trait Analysis by Association, Evolution and Linkage
Hi Alden,

It was the issue with the fastq file name having "text" vs "txt". Good eye bc I didn't even notice I typed the word out!

Gina

Reply all
Reply to author
Forward
0 new messages