AftrRAD.pl output getting stuck

47 views
Skip to first unread message

Stefanie Farrington

unread,
Nov 15, 2017, 3:22:15 PM11/15/17
to AftrRAD
Hello,
I have recently been having issues running AftrRAD on a large dataset. After multiple tests, it seems to be getting stuck in the same spot after running for a few days:

Rapidly searching unique sequence number 86000 of 700355 for potentially allelic read pairs.

Rapidly searching unique sequence number 8700


Here is my command:

perl AftrRAD.pl re-0 P2-noP2 dplexedData-1


I am using demultiplexed data that has been trimmed to 70nt. I have allocated 240G of of memory and 28 days for the run, so it shouldn't be timing out or running out of memory. I have been able to successfully complete the tutorial in the past.


I would appreciate any input or advice!

Thank you,

Stefanie

Mike Sovic

unread,
Nov 16, 2017, 11:20:46 PM11/16/17
to AftrRAD
Hi Stefanie,

I actually got a similar message from someone else by e-mail not long ago.  I've made a few tweaks to the attached script - give it a try to see if it helps at all and let me know how it goes.  Since making the changes, I've only tested it out with the single-processor version (maxProcesses-1), so if you're running in parallel, it might be a good idea to first run it through the example data to make sure I didn't introduce any problems with that part of the script.  As long as that looks OK, give it a try with your full dataset and if you still have the same problem, let me know and we'll keep trying.

             Mike
AftrRAD_v5.0.1.pl

Stefanie Farrington

unread,
Nov 27, 2017, 5:46:25 PM11/27/17
to AftrRAD
Hi Mike,
Thanks for your reply! I was eventually able to get everything to run, but the run time for 7 samples was 11 days so I am hoping to reduce the time needed for the other 80+ samples. 

I tried running the script you sent, and AftrRAD ran just fine. However, I received this error message when running Genotype.pl with the example data:

Currently genotyping sample...OH_KLDR1006 Error in if (NumTrials < 3) { : missing value where TRUE/FALSE needed

In addition: Warning messages:

1: NAs introduced by coercion 

2: NAs introduced by coercion 

Execution halted

No such file or directory at Genotype.pl line 461.


Thank you again for your help, and any other suggestions for reducing run times are greatly appreciated!

Best,

Stefanie

Mike Sovic

unread,
Nov 29, 2017, 10:45:18 PM11/29/17
to AftrRAD
Hi Stefanie,

Glad you got it to run.  In terms of the run time, I do have some ideas of how to speed it up, but just haven't had time to sit down and make that happen.  Eleven days is a fairly long run, but the good news is that adding additional individuals shouldn't increase the run time very dramatically beyond that, as the run time is determined primarily by the number of loci (unique reads) in the dataset (this assumes the individuals you add don't introduce a lot of new loci to the analysis).

I haven't yet been able to recreate that error message.  I'll try again tomorrow.  Can you check the TempFiles directory and see if there is a file named ForBinomialTestOH_KLDR1006.txt?  If so, open that file (or similar files associated with the other samples in the example dataset) and see if they contain data.  I'm kind of guessing they might not, and if not, that would suggest there actually was a problem with the AftrRAD run - maybe double check the console output and make sure there weren't any warnings/errors printed.

               Mike

Stefanie Farrington

unread,
Nov 30, 2017, 12:26:22 PM11/30/17
to AftrRAD
Hi Mike,
Thank you for your responses!
I looked at TempFiles/ForBinomialTestOH_KLDR1006.txt and it did have data. However, the output file did contain several warnings/errors (I included some examples below).  For now we are running the full data set with the original script, and hoping the run time won't be too bad.

Let me know if you need any more information, thank you again for all of your help!
Best,

Stefanie 



Scoring each pairwise alignment.  Alignments exceeding 90 % similarity will be retained.

Parsing aligned sequences into candidate loci.

sh: mafft: command not found

 

Finished writing the mafft output.

Use of uninitialized value in addition (+) at AftrRAD.pl line 3965, <READFILE> line 2.

Use of uninitialized value in addition (+) at AftrRAD.pl line 3965, <READFILE> line 2.

Use of uninitialized value $SecondCount in numeric ge (>=) at AftrRAD.pl line 3974, <READFILE> line 2.

Use of uninitialized value $ThirdCount in numeric ge (>=) at AftrRAD.pl line 3991, <READFILE> line 2.

 

Use of uninitialized value $TempSeqArray[1] in concatenation (.) or string at AftrRAD.pl line 4444, <READFILE> line 144.

Use of uninitialized value in concatenation (.) or string at AftrRAD.pl line 4444, <READFILE> line 144.

Identifying and removing paralogous loci.

Mike Sovic

unread,
Nov 30, 2017, 1:05:40 PM11/30/17
to AftrRAD
Hi Stefanie,

Based on that output, it kind of looks like the mafft aligner isn't installed/working properly.  To check, you should be able to just type 'mafft' (without the quotes) at the command line and the program should try to run (it will ask for an input file).  If that doesn't work, you need to check to make sure that mafft is installed properly and is in your path.  Check this and get back with me if you still have problems.

            Mike

Stefanie Farrington

unread,
Dec 12, 2017, 1:29:09 PM12/12/17
to AftrRAD
Hi Mike,
I checked on mafft and it appeared to be installed correctly, so for now I am trying running my samples with the original script.

I am currently running 95 samples (scaled up from 7 in my original test runs). I allocated 240G and 28 days for the job to run. The job has been running for 12 days and hasn't changed in 8 days. The end of the output file reads:

Identifying all unique sequences within the dataset.


Creating file to test mean read counts.


Is this anything to be concerned about?


Thanks once again for all of your help!

Stefanie

Mike Sovic

unread,
Dec 12, 2017, 10:57:48 PM12/12/17
to AftrRAD
Hi Stefanie,

That sounds like a really long time for it to be sitting at that spot.  Do you know how long the last run that worked sat at this step?  

One of the things that I think is happening with the program (in general, not necessarily in your case, but possibly), is that it is unnecessarily using much more RAM than it should be.  I made some comments in a response to a post some time ago about rough estimates for memory usage.  It turns out that what I wrote there is how the program should behave, but in actuality, I think there are some bugs that cause it to use much more than what I wrote there.  I've had just a couple users report AftrRAD getting hung up, usually either at the step you're at now, or at the next step "Rapidly searching unique sequence number...", but I figure many more folks have had issues but just haven't reported them.  In these cases, I expect they might be taxing their available RAM and bogging down the analysis.  This issue is going to be exacerbated with larger numbers of loci in the dataset (i.e. if you're using frequent cutters such as 4-bp cutters for the restriction enzyme digestion).  The updates I made in the script I sent you were an attempt to fix this problem, and I'll make sure to incorporate these changes into the next version, though want to make sure they aren't introducing any new problems (still not sure why you're not having success with it).  

Unfortunately, I'm not sure what to suggest in terms of the current run.  If you're pushing your RAM limit, it's possible it will slow it down to the point that it is just not going to finish in any reasonable time.  Alternatively, you might just have a really large number of loci for it to parse through, and it might be running as expected.  In that case, I have no idea whether your 28 day wall time will be sufficient - I don't think I've ever heard of an analysis going that long, but I'm sure it's possible if the dataset is large enough (again, in terms of numbers of loci - this is what has the greatest impact on run time in AftrRAD).  One thing this does bring to mind - you mentioned previously that your data have been trimmed.  Are you confident that all of the reads are at 70 bp?  I just ask because having reads of varying lengths can have the same effect on run times as increasing the number of loci.

                     Mike 
        

Stefanie Farrington

unread,
Jan 3, 2018, 10:25:33 AM1/3/18
to AftrRAD
Hi Mike,
We checked the reads and they are all at 70nt, so I don't think this is the issue.
The most recent job timed out after 28 days while performing 
"Rapidly searching unique sequence number 85000 of 9438351 for potentially allelic read pairs."

Is it possible to start a run of AftrRAD at the stage “Performing heuristic search of all pairwise comparisons to identify potentially allelic pairs” so that we don't lose all of the progress from the last 28 days? Alternatively, we could try again with the updated script, but we have still not been able to perform a test run successfully.


Any advice is greatly appreciated as always.

Mike Sovic

unread,
Jan 3, 2018, 10:37:58 PM1/3/18
to AftrRAD
Hi Stefanie,

Sorry you're having these problems, know this is frustrating, but I'm cautiously optimistic that we can get it figured out.  This last run is actually pretty informative for me.  It kind of supports my suspicion that the problem is with the RAM, as I described a bit in my last message.  I say this because this time around, with your full dataset, it has gotten stuck just slightly before where it did in the smaller dataset - 85,000 as opposed to 86,000.  This makes sense if it is running out of RAM.  Basically, as the program runs, it is creating these large data structures (vectors, arrays, etc.) that store information such as sequence reads.  Each of these can get very large, as they are storing not only the unique reads in the dataset, but also in some cases pairwise comparisons of these unique reads, and as the dataset gets large, there can be a really big number of these pairwise comparisons.  These data structures are stored in RAM.  One problem with all of the versions up through 5.0 is that I didn't do a very good job of clearing out these data structures when the program was finished using them - just kept adding new ones, so they built up unnecessarily.  The datasets we ran in our lab weren't large enough to cause any problems, so I didn't catch this issue.  I expect that clearing out those data structures once the program is finished using each of them will alleviate a lot of the problem, which is what I did in the version 5.0.1 that I sent earlier in this thread.  In your case, when you hit around sequence number 85000 in the heuristic search, I think those data structures are building up to the point where you are reaching the RAM you've allocated, and at that point I figure the analysis is slowing down to the point where it is basically just not moving.  So, two options...

1.) Figure out how to get the 5.0.1 version to work.  I think as a longer-term solution this is probably going to be important.  Work with it with the example dataset, and let me know what specific warnings/error messages you're getting.  I went back and tested it out under several conditions today, including with the parallel option, and everything seemed to work OK on my end.

2.) In the shorter term, I think I can get a script to you that will allow you to just pick up your analysis basically where it left off.  I will e-mail that to you directly with some associated instructions.  Doing that should also get you around the RAM issue.

              Mike
Reply all
Reply to author
Forward
0 new messages