Incompatible read groups in gstacks 2.68

55 views
Skip to first unread message

sedg...@gmail.com

unread,
May 9, 2025, 5:24:40 PMMay 9
to Stacks
Hi,

I'm working with three sets of samples that I'm trying to combine in a single analysis. There are 384 samples in total, and I've selected 81 for inclusion in the catalog. In several attempts at gstacks, it appears that samples that aren't in the catalog are all labelled with ID 82:

```
Reading BAM headers...
Error: Incompatible read groups (ID:82, SM:BWA13) and (ID:82, SM:BWA1).
Error: In BAM files './stacks25/BWA13.matches.bam' and './stacks25/BWA1.matches.bam.
```

I wonder if this is a regression of the bug that was fixed in version 2.67, or if I've done something wrong? The only thing I can think of is using process_radtags 2.67 output with the rest of the pipeline using 2.68 maybe my mistake?

I'll paste the steps I've done below. I'd really like to avoid using all 384 samples in the catalog, as this is taking a lot of time, and will keep growing as I add new samples each year.

Thanks,

Tyler

I used v2.67 of process_radtags with the following arguments:
process_radtags -1 <SAMPLE>.R1.fastq.gz \
                  -2 <SAMPLE>R2.fastq.gz \
                  -b <BARCODES> -o <OUTDIR> \
                  --renz-1 nsiI --renz-2 mspI --inline-null -c -q -r --threads 16 

From here forwards I used v2.68

ustacks -t gzfastq -f ${READS_DIR}/${SAMPLE}.1.fq.gz \
          -o ${OUT_DIR} -m 3 --name ${SAMPLE} -M $M -p ${THREADS}

cat25Popmap.tsv contains 81 samples:

cstacks -n 4 -P stacks25/ -M cat25Popmap.tsv -p 48

sstacks -c ${DIR} -s ${DIR}/${SAMPLE} -p $THREADS

tsv2bam -P stacks25 -s $SAMPLE -R ./prort/all_files/ -t $THREADS

gstacks -P ./${DIR}/ -M gs25Popmap.tsv -t 32                         

Catchen, Julian

unread,
May 12, 2025, 5:44:17 PMMay 12
to stacks...@googlegroups.com

Hi Tyler,

 

Using an older version of process_radtags should not be related to this problem. When you run sstacks, I think you need to specify a popmap containing all the samples, so that sstacks only runs one time, looping itself over all the samples supplied in the popmap. I think you are running sstacks repeatedly in a shell loop? If you had all samples in the catalog this would be fine, but since you have a subset, sstacks will check the catalog and increment the ID from the last one in the catalog – so if you run sstacks on its own multiple times, it should explain why you get the same ID repeated multiple times. The bug fixed in 2.67 would have misnumbered two of the samples, but all the rest would increment as expected.

 

Best,

 

Julian

sedg...@gmail.com

unread,
Sep 18, 2025, 10:59:14 AM (8 days ago) Sep 18
to Stacks
Thanks Julian,

Sorry for the slow response. I missed your reply in the spring and am only getting back to bioinformatics now after field work now.

Is there any way to assign IDs manually?

The issue I have is I'm working on a multi-year sampling project. In the first year I was unsure how much depth I needed for my samples (octoploid plants), and ended up with a lot of reads. Subsequent testing reveals I can cut the depth down by 75% or more. But I still need to include the first 96 samples, as I add more each year.

Breaking up the analysis into hundreds of individual array jobs reduces the total run time from months to about a week. Constructing the catalog is one of the more time consuming steps. Limiting that to a selection of 80 samples speeds that up, and means I don't need to rebuild the catalog each year as I add samples. It takes about 4 days for my 80 samples. It increases a lot for additional samples - from 10 days to several weeks for 96. I currently have 384 samples with another 200 collected this year.

All of which to say going back to a full catalog is going to add months to my workflow every year.

sstacks is the other challenge. Running it on each sample individually takes me a few days. Putting them all together will take a month.

What would you suggest in this situation? Is there any point in the workflow where I could manually assign IDs to samples? Or maybe I need to subsample those initial 96 samples down to a reasonable size? The biggest sample has 2.0e08 retained reads, and from my experiments I could drop that to 5.0e07 (or perhaps less) with no loss of resolution (distinguishing among clones as one of the main requirements). It would be straightforward enough to select the first 5.0e7 reads from each file after process radtags if that makes sense.

Thanks again for your help,

Tyler
Reply all
Reply to author
Forward
0 new messages