Question about reads sizes

35 views

Skip to first unread message

Tatiana Barroca

unread,

Mar 11, 2016, 9:07:01 AM3/11/16

to AftrRAD

Dear Mike,

I run AftrRAD.pl and Geotype.pl scripts and realized that it recognizes a few reads per sample and can’t proceed with the other analysis.

I read in the tutorial that reads should have the same size. I have reads with different sizes although most of them have the same size (98 bp).

I used pyRAD and it demultiplexed my samples.

Do you think that the difference among the reads is the problem? Or Am I missing something obvious?

Best,

Tatiana

Mike Sovic

unread,

Mar 11, 2016, 11:47:17 AM3/11/16

to AftrRAD

Hello Tatiana,

You are correct that you should not include reads of different lengths in AftrRAD (at least as of version 5.0). There are a couple of reasons for this…

1.) For a locus that has no variability among all of your samples (a monomorphic locus), the program will likely recognize it as a polymorphic locus - probably with indels at the end. An example with hypothetical read counts for each of the "alleles" for one of your samples...

ATTAGATCAGATAAAAACCCCAG 20

ATTAGATCAGATAAAAACCC - - - 15

These two reads will be recognized as different alleles that differ by an insertion/deletion, when the only difference is the read length, and the locus is truly monomorphic.

2.) For a truly polymorphic locus, you'll end up with two versions of each true allele (notice the A->T SNP at position 14)

ATTAGATCAGATAAAAACCCCAG 10

ATTAGATCAGATAAAAACCC - - - 8

ATTAGATCAGATATAAACCCCAG 15

ATTAGATCAGATATAAACCC - - - 10

The problem here is that all truly polymorphic loci will be flagged as paralogous and removed from the dataset.

As I'm writing this, I think I may have just thought of a way to deal with this issue (though it probably won't be a quick fix), so I may look in to that for future versions, but for now, my suggestion is to only include reads of the same length. One option for doing this would be to trim all the reads back to the shortest ones you have. We have done this with a couple of our datasets, and have a simple perl script to do it - if you want that script, just let me know and I'll pass it along.

So, I would guess that the unequal read lengths is a problem. However, I can't really say whether it's the only problem in your run - would need more information for that. For example, were there any specific error messages printed during the run? Alternatively, maybe you could send along a couple example output files - ones that are often helpful as a starting point are are 'Output/RunInfo/ReportX.txt' and 'TempFiles/RawReadCountFiles/RawReadCounts_NonParalogousX.txt'.

Also, did you run your data all the way through PyRAD, or just demultiplex in PyRAD? I can't remember if that program can handle reads of different lengths or not.