Perl errors in AftrRAD.pl

92 views
Skip to first unread message

parisjo...@googlemail.com

unread,
Apr 9, 2015, 7:28:33 AM4/9/15
to aft...@googlegroups.com
Hello Mike and AftrRAD community,

I have just started testing AftrRAD with a test dataset

Output:

Demultiplexing samples for data file l6_NoIndex_L006_R1_001.fastq.

Identifying unique sequences for each individual.

Identifying all unique sequences within the dataset.

Creating file to test mean read counts.

but then I get the following error:

readline() on closed filehandle SORTED at /users/jrp228/AfterRAD/AftrRADv4.1/AftrRAD.pl line 1268.
No such file or directory at /users/jrp228/AfterRAD/AftrRADv4.1/AftrRAD.pl line 1408.

Any ideas?

Thanks - Josie

Mike Sovic

unread,
Apr 9, 2015, 8:31:05 AM4/9/15
to aft...@googlegroups.com
Hi Josie,

Not sure right off what the problem is, but we'll see if we can get it figured out.  Can you check a few things…

1.)  In the TempFiles directory, are there files named 'AllUniquesSorted.txt' and 'GoodSeqsWithBarcodesSorted.txt'?  If so, what are the sizes of these two files?

2.)  In the Output/RunInfo directory, there should be a file named Report_"YourDataFileName".  Open this file and go to the bottom.  You should see some lines that read "The number of unique sequences…" - there should be one line for each sample (barcode) in the dataset.  The line just above these is "This left X sequences for further analysis".  What is this number?

3.)  Also in the Output/RunInfo directory, there should be a file named BarcodeInfo_"YourDataFileName".  Take a look and see approximately how many reads are being assigned to each barcode. 

Hopefully one of these checks will point us in the right direction.


                Mike 

parisjo...@googlemail.com

unread,
Apr 9, 2015, 10:46:43 AM4/9/15
to aft...@googlegroups.com
Hi Mike,

Thank you for the prompt response!


.)  In the TempFiles directory, are there files named 'AllUniquesSorted.txt' and 'GoodSeqsWithBarcodesSorted.
txt'?  If so, what are the sizes of these two files?

There is no AllUniquesSorted.txt but GoodSeqsWithBarcodesSorted.txt exists and it is 6.1G in size


2.)  In the Output/RunInfo directory, there should be a file named Report_"YourDataFileName".  Open this file and go to the bottom.  You should see some lines that read "The number of unique sequences…" - there should be one line for each sample (barcode) in the dataset.  The line just above these is "This left X sequences for further analysis".  What is this number?

This left 68273742 sequences for further analysis.


3.)  Also in the Output/RunInfo directory, there should be a file named BarcodeInfo_"YourDataFileName".  Take a look and see approximately how many reads are being assigned to each barcode.

Total number of matches to each barcode...
ACTGC   TOR_RES_03      338709
ATTAG   TAM_RES_02      2777032
CCTTG   TOR_RES_01      6080590
CCCCA   DAR_RES_03      335997
ACGTA   DAR_ST_01       148750
AAGGG   FRO_SM_01       4408124
CCAAC   FRO_RES_03      4966397
AGAGT   FRO_SM_02       3659630
AATTT   TAM_SM_01       3334698
CCGGT   FRO_SM_03       3203038
CATGA   DAR_RES_01      4131060
CAGTC   DAR_RES_02      3743606
ACACG   TOR_ST_03       6469511
ATCGA   TAM_RES_03      1618779
AAAAA   TAM_SM_03       298169
ATGCT   TOR_ST_02       6608651
ACCAT   TOR_ST_01       6277788
ATATC   DAR_ST_02       5764351
AACCC   TAM_SM_02       1030586
CACAG   FRO_RES_02      3546560
CAACT   TAM_RES_01      4167427
AGGAC   TOR_RES_02      6372222
AGCTG   DAR_ST_03       5610567
AGTCA   FRO_RES_01      2898727

Josie

parisjo...@googlemail.com

unread,
Apr 9, 2015, 10:52:16 AM4/9/15
to aft...@googlegroups.com
HI Mike,

Just a thought - at line 1260 is says:

system "uniq TempFiles/AllSeqsSorted.txt TempFiles/AllUniquesSorted.txt";

instead of:

system "uniq TempFiles/AllSeqsSorted.txt > TempFiles/AllUniquesSorted.txt";

Maybe that's why the AllUniquesSorted.txt file isn't being created?

Josie

Mike Sovic

unread,
Apr 9, 2015, 11:47:25 AM4/9/15
to aft...@googlegroups.com
Well, from what you sent, everything looks good, so kind of odd it's not creating that AllUniquesSorted.txt file.  

Your suggestion above is possible I guess, but seems kind of unlikely to me given that we haven't seen this issue before.  You could certainly try to edit that line in the perl script and see if it helps, or alternatively, try to run the example dataset and see if it works with the script as-is.  

Is there a chance this is a space issue?  We've run in to that before ourselves - basically, you run out of disk space to create and store these large files as the program runs.  See if you can look in to this.  If you're on a Mac, check the disk usage in the Activity Monitor.  If you're short on free space, I'd bet this is the issue.  

If neither of these solve the problem, let us know and we'll keep trying.

      
                Mike     

parisjo...@googlemail.com

unread,
Apr 9, 2015, 12:05:21 PM4/9/15
to aft...@googlegroups.com
Thanks for your suggestions!

I will try it with the example dataset and also with the edit in the perl script and get back to you

Have a nice day

Josie

parisjo...@googlemail.com

unread,
Apr 10, 2015, 10:28:47 AM4/10/15
to aft...@googlegroups.com
Hello Mike,

I've run the adjusted perl script and now the AllUniquesSorted.txt file exists (size 137M). The output is still hanging on:


Identifying all unique sequences within the dataset.

Creating file to test mean read counts.

No new temp files have been created in the last 24 hours (since 16:30 yesterday UK time) and the output has not updated since then either.

I'm running it on 128 GB of memory, so I'm not sure if memory is the problem.

Could you give me the name of the output file "Creating file to test mean read counts" is creating?

Thank you!

Josie

Mike Sovic

unread,
Apr 10, 2015, 12:27:05 PM4/10/15
to aft...@googlegroups.com
Hi Josie,

OK, first to clarify one thing...the disk space I referred to is different than RAM, even though both are often referred to as "memory".  Not having enough disk space is a much bigger issue than the amount of RAM. If your 128 GB is RAM, I'm assuming you're on a Linux system, so where to check the disk space available will likely depend on what operating system you're using.  You can probably just google something like "find disk space centos" or "find disk space ubuntu" to find out.

At this point in the script ("Creating file to test mean read counts") all of the files being written will be in the folder 
TempFiles/ErrorReadTest/.  There will either be a file named ErrorUpdateX.txt, where X is a number between 1 and the number of samples (barcodes) in your datafile, or there will be a file named AllReadsAndDepths.txt.  Take a look and see what is there, and it's size.

Also, one more thing. Just above the line "Creating file to test mean read counts" there should be a section "Identifying unique sequences for each individual".  What kind of numbers are you getting there for each individual?

parisjo...@googlemail.com

unread,
Apr 10, 2015, 12:52:57 PM4/10/15
to aft...@googlegroups.com
Hi Mike,

Okay I've checked and I have 115G of disk space - sorry for the confusion!

In TempFiles/ErrorReadTest/ I have:

212M Apr  9 16:24 AllReadsAndDepths.txt
48M Apr  9 16:24 ErrorTestOut.txt
 0 Apr  9 16:24 SeqsWithZeroCounts.txt
 90M Apr  9 16:24 VarianceErrors.txt



Identifying unique sequences for each individual.
163542 TempFiles/UniqueWithCountsIndividualDAR_RES_01.txt
160671 TempFiles/UniqueWithCountsIndividualDAR_RES_02.txt
30099 TempFiles/UniqueWithCountsIndividualDAR_RES_03.txt
210984 TempFiles/UniqueWithCountsIndividualTOR_RES_01.txt
222770 TempFiles/UniqueWithCountsIndividualTOR_RES_02.txt
19240 TempFiles/UniqueWithCountsIndividualTOR_RES_03.txt
209544 TempFiles/UniqueWithCountsIndividualTOR_ST_01.txt
219606 TempFiles/UniqueWithCountsIndividualTOR_ST_02.txt
201833 TempFiles/UniqueWithCountsIndividualTOR_ST_03.txt
183514 TempFiles/UniqueWithCountsIndividualTAM_SM_01.txt
76711 TempFiles/UniqueWithCountsIndividualTAM_SM_02.txt
9240 TempFiles/UniqueWithCountsIndividualTAM_SM_03.txt
6129 TempFiles/UniqueWithCountsIndividualDAR_ST_01.txt
197676 TempFiles/UniqueWithCountsIndividualDAR_ST_02.txt
209882 TempFiles/UniqueWithCountsIndividualDAR_ST_03.txt
181093 TempFiles/UniqueWithCountsIndividualFRO_SM_01.txt
188746 TempFiles/UniqueWithCountsIndividualFRO_SM_02.txt
152655 TempFiles/UniqueWithCountsIndividualFRO_SM_03.txt
139696 TempFiles/UniqueWithCountsIndividualFRO_RES_01.txt
158998 TempFiles/UniqueWithCountsIndividualFRO_RES_02.txt
182050 TempFiles/UniqueWithCountsIndividualFRO_RES_03.txt
189932 TempFiles/UniqueWithCountsIndividualTAM_RES_01.txt
133176 TempFiles/UniqueWithCountsIndividualTAM_RES_02.txt
99592 TempFiles/UniqueWithCountsIndividualTAM_RES_03.txt

Sorry for being a pain!

Josie

Mike Sovic

unread,
Apr 10, 2015, 1:34:14 PM4/10/15
to aft...@googlegroups.com
No worries, Josie - we're getting there (I think).  

Based on the number of unique reads per sample, you've got a relatively large number of loci in this dataset (at least relative to a lot we've seen).  I'm guessing that you're maybe digesting with relatively frequent cutters (i.e. 4-bp or 6-bp restriction sites) or maybe you're working with a very large genome?  Either way, given the number of loci, I wouldn't be too shocked that it's still running after 24 hours.  

115GB of free disk space at this point in the run should be good, so indeed that doesn't seem to be an issue.  

I'm a bit concerned about how long it will end up taking to complete the run.  Run times for AftrRAD are not much different between Macs and Linux systems for most of the steps, with the exception of the alignment step (which you haven't gotten to yet).  Here, Macs tend to be much faster than Linux, and the difference will likely be exaggerated as the number of alignments required increases (more loci = more alignments, on average, but the levels of polymorphism in your group will also come in to play here).  The issue is with the ACANA alignment program - it has been optimized for Mac, but not yet for Linux.  The ACANA authors have told us they're planning to optimize for Linux too, but it hasn't happened yet.

Everything in the ErrorReadTest directory looks good to me, so I'd say for now, let it run - let's see where it's at in another day or two.

                 Mike      

parisjo...@googlemail.com

unread,
May 5, 2015, 7:36:45 AM5/5/15
to aft...@googlegroups.com
Hello Mike.

Just an update on the processing using AftrRAD.pl with my dataset.
The script would hang at
system "uniq TempFiles/AllSeqsSorted.txt  TempFiles/AllUniquesSorted.txt";

so I ended up adding in the '>'

As run time goes on my dataset, I thought the following stats might be useful ...
I'm running the job on a 512Gb node, it's been running for a week, and is currently at:

Number of pairwise comparisons to align and evaluate is 277197330
Aligning all potentially alleleic read pairs with ACANA.
Working on alignment 1000 of 277197330.
Working on alignment 2000 of 277197330.
Working on alignment 3000 of 277197330.
Working on alignment 4000 of 277197330.
Working on alignment 5000 of 277197330.
Working on alignment 6000 of 277197330.
Working on alignment 7000 of 277197330.
Working on alignment 8000 of 277197330.
Working on alignment 9000 of 277197330.
Working on alignment 10000 of 277197330.
Working on alignment 11000 of 277197330.
Working on alignment 12000 of 277197330.
Working on alignment 13000 of 277197330.
Working on alignment 14000 of 277197330.
Working on alignment 15000 of 277197330.
Working on alignment 16000 of 277197330.
Working on alignment 17000 of 277197330.
Working on alignment 18000 of 277197330.


Will let you know when it finishes!

Thanks,
Josie

Mike Sovic

unread,
May 6, 2015, 9:01:17 AM5/6/15
to aft...@googlegroups.com
Hi Josie, 

Thanks very much for the update.  This sort of feedback is very helpful moving forward.  

This alignment stage is the last major step, so at this point you should be able to pretty well extrapolate how long it will take to complete.  Given the number of alignments you have to do, my guess is that it's going to take a really long time, as this is the part that runs slower on Linux systems.  If you have the option of running in parallel, you might give the parallel version a try - for this alignment step, the run time should decrease almost proportionally to the number of processors you can run (i.e. running 5 processors should take 1/5 of the time, running 10 processors should take one tenth of the time, etc.).  

I think the 'system unique…' step you referred to above is the next one I'm going to see if I can parallelize.  If I can get that to work, then we should be getting towards a much more reasonable time frame in terms of how long it takes to run datasets with really large numbers of polymorphic loci such as yours.

                        Mike    
Reply all
Reply to author
Forward
0 new messages