Dear Dr. Catchen and group members,
I am new to bioinformatics and would greatly appreciate your guidance on analyzing my ddRAD dataset using Stacks 2.68.
Here is a summary of the tests performed so far for the reference genome alignment, after running process.radtags on paired-end data with inline_null.
First attempt: I concatenated all four files per sample (.1, .2, rem.1, rem.2) into a single FASTQ. The alignment worked well (~99.9% of reads mapped), but all reads were treated as single-end by BWA, without preserving any pairing information.
Second attempt: I concatenated .1 with rem.1 and .2 with rem.2, keeping R1 and R2 separate. The resulting BAM files were paired-end with still over 90% mapping, but only about 60% of the reads were correctly paired, maybe suggesting that the rem files are not synchronized between forward and reverse reads.
Third attempt: we used only the .1 and .2 files, without concatenating the rem files. The percentage of correctly paired reads remained similar overall, although it varied from sample to sample, reaching 80% in some cases.
Based on these results, I would like to ask your opinion on a few points:
Should I still concatenate all four files together into one? Although it seems to me that the latest protocols do not perform this step and that the two files, R1 and R2, are handled separately.
Should rem files be included in the creation of loci in Stacks 2 (v2.68), or is it preferable to work only with .1 and .2 files, ensuring correct pairing even if it means discarding rem reads?
Finally, should we be concerned about the relatively low percentage of properly paired reads (60–80%), or is this acceptable to proceed with downstream analyses in Stacks?
I'd be very grateful for any advice you can provide. I just want to make sure I'm processing my paired-end data correctly before proceeding with the full dataset.
Thank you so much for your time and for your continued support of the Stacks community.
Best regards,
Lucrezia
Dear Dr. Catchen and group members,
I am new to bioinformatics and would greatly appreciate your guidance on analyzing my ddRAD dataset using Stacks 2.68.
Here is a summary of the tests performed so far for the reference genome alignment, after running process.radtags on paired-end data with inline_null.
First attempt: I concatenated all four files per sample (.1, .2, rem.1, rem.2) into a single FASTQ. The alignment worked well (~99.9% of reads mapped), but all reads were treated as single-end by BWA, without preserving any pairing information.
Second attempt: I concatenated .1 with rem.1 and .2 with rem.2, keeping R1 and R2 separate. The resulting BAM files were paired-end with still over 90% mapping, but only about 60% of the reads were correctly paired, maybe suggesting that the rem files are not synchronized between forward and reverse reads.
Third attempt: we used only the .1 and .2 files, without concatenating the rem files. The percentage of correctly paired reads remained similar overall, although it varied from sample to sample, reaching 80% in some cases.
Based on these results, I would like to ask your opinion on a few points:
Should I still concatenate all four files together into one? Although it seems to me that the latest protocols do not perform this step and that the two files, R1 and R2, are handled separately.
Should rem files be included in the creation of loci in Stacks 2 (v2.68), or is it preferable to work only with .1 and .2 files, ensuring correct pairing even if it means discarding rem reads?
Finally, should we be concerned about the relatively low percentage of properly paired reads (60–80%), or is this acceptable to proceed with downstream analyses in Stacks?