Incompatible read groups in gstacks 2.68

sedg...@gmail.com

unread,

May 9, 2025, 5:24:40 PMMay 9

to Stacks

Hi,

I'm working with three sets of samples that I'm trying to combine in a single analysis. There are 384 samples in total, and I've selected 81 for inclusion in the catalog. In several attempts at gstacks, it appears that samples that aren't in the catalog are all labelled with ID 82:

```

Reading BAM headers...
Error: Incompatible read groups (ID:82, SM:BWA13) and (ID:82, SM:BWA1).
Error: In BAM files './stacks25/BWA13.matches.bam' and './stacks25/BWA1.matches.bam.
```

I wonder if this is a regression of the bug that was fixed in version 2.67, or if I've done something wrong? The only thing I can think of is using process_radtags 2.67 output with the rest of the pipeline using 2.68 maybe my mistake?

I'll paste the steps I've done below. I'd really like to avoid using all 384 samples in the catalog, as this is taking a lot of time, and will keep growing as I add new samples each year.

Thanks,

Tyler

I used v2.67 of process_radtags with the following arguments:

process_radtags -1 <SAMPLE>.R1.fastq.gz \
-2 <SAMPLE>R2.fastq.gz \
-b <BARCODES> -o <OUTDIR> \
--renz-1 nsiI --renz-2 mspI --inline-null -c -q -r --threads 16

From here forwards I used v2.68

ustacks -t gzfastq -f ${READS_DIR}/${SAMPLE}.1.fq.gz \
-o ${OUT_DIR} -m 3 --name ${SAMPLE} -M $M -p ${THREADS}

cat25Popmap.tsv contains 81 samples:

cstacks -n 4 -P stacks25/ -M cat25Popmap.tsv -p 48

sstacks -c ${DIR} -s ${DIR}/${SAMPLE} -p $THREADS

tsv2bam -P stacks25 -s $SAMPLE -R ./prort/all_files/ -t $THREADS

gstacks -P ./${DIR}/ -M gs25Popmap.tsv -t 32

Catchen, Julian

unread,

May 12, 2025, 5:44:17 PMMay 12

to stacks...@googlegroups.com

Hi Tyler,

Using an older version of process_radtags should not be related to this problem. When you run sstacks, I think you need to specify a popmap containing all the samples, so that sstacks only runs one time, looping itself over all the samples supplied in the popmap. I think you are running sstacks repeatedly in a shell loop? If you had all samples in the catalog this would be fine, but since you have a subset, sstacks will check the catalog and increment the ID from the last one in the catalog – so if you run sstacks on its own multiple times, it should explain why you get the same ID repeated multiple times. The bug fixed in 2.67 would have misnumbered two of the samples, but all the rest would increment as expected.

Best,

Julian

sedg...@gmail.com

unread,

Sep 18, 2025, 10:59:14 AMSep 18

to Stacks

Thanks Julian,

Sorry for the slow response. I missed your reply in the spring and am only getting back to bioinformatics now after field work now.

Is there any way to assign IDs manually?

The issue I have is I'm working on a multi-year sampling project. In the first year I was unsure how much depth I needed for my samples (octoploid plants), and ended up with a lot of reads. Subsequent testing reveals I can cut the depth down by 75% or more. But I still need to include the first 96 samples, as I add more each year.

Breaking up the analysis into hundreds of individual array jobs reduces the total run time from months to about a week. Constructing the catalog is one of the more time consuming steps. Limiting that to a selection of 80 samples speeds that up, and means I don't need to rebuild the catalog each year as I add samples. It takes about 4 days for my 80 samples. It increases a lot for additional samples - from 10 days to several weeks for 96. I currently have 384 samples with another 200 collected this year.

All of which to say going back to a full catalog is going to add months to my workflow every year.

sstacks is the other challenge. Running it on each sample individually takes me a few days. Putting them all together will take a month.

What would you suggest in this situation? Is there any point in the workflow where I could manually assign IDs to samples? Or maybe I need to subsample those initial 96 samples down to a reasonable size? The biggest sample has 2.0e08 retained reads, and from my experiments I could drop that to 5.0e07 (or perhaps less) with no loss of resolution (distinguishing among clones as one of the main requirements). It would be straightforward enough to select the first 5.0e7 reads from each file after process radtags if that makes sense.

Thanks again for your help,

Tyler

sedg...@gmail.com

unread,

Oct 6, 2025, 12:01:27 PMOct 6

to Stacks

Hi again,

I've made some progress. I don't think sstacks is the problem, tsv2bam is.

I repeated the entire workflow on a set of truncated read files, running sstacks as a single job (i.e., not an array). I got the same error at the gstacks stage. Digging into the output files, the problematic ID numbers seem to be assigned by tsv2bam, which I had been running as an array job as well. Shifting that back to a single job, I could successfully complete the pipeline using my original samples (with sstacks run as an array) as well as on my new test run (sstacks run as a single job).

- tyler

Reply all

Reply to author

Forward