Issue with building NLMSA

40 views
Skip to first unread message

Vishal Patel

unread,
Apr 25, 2013, 1:23:33 PM4/25/13
to pygr...@googlegroups.com
Hi, 

We usually download the pygr NLMSA from the UCLA server. However, the latest mm10 - multiz60way alignment does not exist there so we decided to build it. 

First I ran into issues with max size, 

       1 msa = cnestedlist.NLMSA(pathstem=pathstem,
       2                             seqDict=genomeUnion,
----> 3                             mafFiles=maflist, mode="w")

site-packages/pygr-0.8.2-py2.7-linux-x86_64.egg/pygr/cnestedlist.so in pygr.cnestedlist.NLMSA.__init__()
site-packages/pygr-0.8.2-py2.7-linux-x86_64.egg/pygr/cnestedlist.so in pygr.cnestedlist.NLMSA.readMAFfiles()

ValueError: MAF block too long!  Increase max size

Setting the maxlen and maxint to a billion also did not help so I set it to sys.maxint. 

However this code still fails because it open 1000's of ".build" files and fails because it does not have "write permission" on the next file handle it tries to open. Basically hits the max open files limit. I increased the ulimit to 10,000 files at which point it fails with "Segmentation fault (core dumped)"

Here is lsof out every 0.1 s when it failed, 
170 # these are file handles to genome files
170
170
170
170
170
170
726 # starts creating .build files
1682
2443
3706
6300
7209
9160 # hits the upper limit again and fails with a core dump. 

I have tried this with py2.6/pygr0.8.1 and py2.7/pygr0.8.2 and get the same error. 

Has anyone experienced this? 
What are the .build files?

Vishal








Christopher Lee

unread,
Apr 25, 2013, 2:58:49 PM4/25/13
to pygr...@googlegroups.com


On Thursday, April 25, 2013 10:23:33 AM UTC-7, Vishal Patel wrote:


ValueError: MAF block too long!  Increase max size

Setting the maxlen and maxint to a billion also did not help so I set it to sys.maxint. 

These settings have nothing to do with the MAF block size limitation you ran into.  All you need to do is boost the buffer size for reading a single MAF block (currently set to 4096).  It's trivial but it requires editing the source code to change that number in two places:

cnestedlist.pyx: 1731, change size of im array from 4096 to something big like 16384
cnestedlist.pyx: 1774, change 4096 to the same increased value you used for im

Then recompile pygr:

python setup.py build
python setup.py install

You can also enter a bug report suggesting that this be made user configurable or automatically resized.

Vishal Patel

unread,
Apr 25, 2013, 3:39:15 PM4/25/13
to pygr...@googlegroups.com
Actually, sorry I wasn't clear. When I set maxlen and maxint to sys.maxint I don't get that error anymore. 

Instead here is the error,

when ulimit for open files = 1024 

INFO:root:MAF FILES:['UCSC/genomes/MOUSE/mm10/maf/test/chr1.maf']
INFO buildNLMSA.main: Processing MAF file: UCSC/genomes/MOUSE/mm10/maf/test/chr1.maf
INFO:pygr-log:Processing MAF file: UCSC/genomes/MOUSE/mm10/maf/test/chr1.maf
Traceback (most recent call last):
  File "/home/vishalrp/codebase/genomics/trunk/code/python/scripts/buildNLMSA.py", line 79, in <module>
  File "/home/vishalrp/codebase/genomics/trunk/code/python/scripts/buildNLMSA.py", line 65, in main
  File "cnestedlist.pyx", line 1508, in pygr.cnestedlist.NLMSA.__init__
  File "cnestedlist.pyx", line 1785, in pygr.cnestedlist.NLMSA.readMAFfiles
  File "cnestedlist.pyx", line 1622, in pygr.cnestedlist.NLMSA.newSequence
  File "cnestedlist.pyx", line 1326, in pygr.cnestedlist.NLMSASequence.__init__
IOError: unable to open in write mode: UCSC/genomes/MOUSE/mm10/maf/pygrdata2/895.build


When the ulimit was increased to 4096, 
 It gave a similar error being unable to open file "3967.build".

When the ulimit was increased to 10224, 
It fails with Segmentation fault (core dump).

Is this linked to "im" array. 

Thanks! 

Vishal



On Thursday, April 25, 2013 10:23:33 AM UTC-7, Vishal Patel wrote:

Paul Rigor

unread,
Apr 25, 2013, 3:42:34 PM4/25/13
to pygr...@googlegroups.com
Hi Chris and community,

So after applying the modifications according to pyrex, we're now encountering the following IO error message on a machine with ulimit (soft/hard) already set to 4096. 

The error message follows:

INFO buildNLMSA.main: Processing MAF file: /home/baldig/projects/genomics/nonsvn/data/UCSC/genomes/MOUSE/mm10/maf/chr19.maf
INFO:pygr-log:Processing MAF file: /home/baldig/projects/genomics/nonsvn/data/UCSC/genomes/MOUSE/mm10/maf/chr19.maf
Traceback (most recent call last):
  File "/home/prigor/codebase/genomics/trunk/code/python/scripts/buildNLMSA.py", line 91, in <module>
  File "/home/prigor/codebase/genomics/trunk/code/python/scripts/buildNLMSA.py", line 77, in main
  File "cnestedlist.pyx", line 1508, in pygr.cnestedlist.NLMSA.__init__
  File "cnestedlist.pyx", line 1798, in pygr.cnestedlist.NLMSA.readMAFfiles
  File "cnestedlist.pyx", line 1622, in pygr.cnestedlist.NLMSA.newSequence
  File "cnestedlist.pyx", line 1326, in pygr.cnestedlist.NLMSASequence.__init__
IOError: unable to open in write mode: /extra/baldig1/genomics/pygrdata/alignments/MOUSE/mm10/multiz60way/3967.build



On Thu, Apr 25, 2013 at 12:14 PM, Paul Rigor <paul....@uci.edu> wrote:
Hi Chris,

Nevermind, I guess the warnings were actual errors. So I've changed the __new__ to __cinit__ according to the messages. All seems well and the module compiles. We'll keep you posted on the actual NLMSA building for the MAFs, in case we encounter anything else.

Thanks again,
Paul
On Thu, Apr 25, 2013 at 12:07 PM, Paul Rigor <paul....@uci.edu> wrote:
Hi Chris,

So which version of Pyrex and GCC is the most compatible? Currently, when attempting to recompile using the modifications you recommended, we get the following compilation warnings and error below. We have used pyrex versions 0.8.2.x - 0.9.9 and gcc 4.1.2, 4.3.0, & 4.7.1.

THanks!
Paul

pyrexc pygr/cnestedlist.pyx --> pygr/cnestedlist.c
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:8:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:51:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:167:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:371:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:424:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:446:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:1112:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:1138:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:167:40: Warning: 'not None' will become the default in a future version of Pyrex. Use 'or None' to allow passing None.
building 'pygr.cnestedlist' extension
creating build/temp.linux-x86_64-2.6/pygr/apps
gcc -pthread -pg -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/baldig/shared_libraries/centos64/pkgs/python/2.6.5/include/python2.6 -c pygr/intervaldb.c -o build/temp.linux-x86_64-2.6/pygr/intervaldb.o
gcc -pthread -pg -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/baldig/shared_libraries/centos64/pkgs/python/2.6.5/include/python2.6 -c pygr/cnestedlist.c -o build/temp.linux-x86_64-2.6/pygr/cnestedlist.o
pygr/cnestedlist.c:1:2: error: #error Do not use this file, it is the result of a failed Pyrex compilation.
error: command 'gcc' failed with exit status 1



--
You received this message because you are subscribed to the Google Groups "pygr-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pygr-dev+u...@googlegroups.com.
To post to this group, send email to pygr...@googlegroups.com.
Visit this group at http://groups.google.com/group/pygr-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



Christopher Lee

unread,
Apr 25, 2013, 5:58:31 PM4/25/13
to pygr...@googlegroups.com
The size of this alignment is forcing it to generate more build files than your system's open file limit.  The solution will have to be to close some of the files that we are not currently writing to (and reopen them in append mode if we need to write more data to them later).  Presumably it would track the order in which it has written to the files (from most recently to least recently) and close the one that it used least recently.  Basically, there's a build_ifile[] array in NLMSA.readMAFfiles(), and two lines where it writes to those by calling saveInterval().  I have a pretty clear idea of how to implement this, but can't work on this right away.  If you want I can describe how this could be implemented...

Yours,

Chris

On Thursday, April 25, 2013 12:42:34 PM UTC-7, Paul Rigor wrote:
Hi Chris and community,

So after applying the modifications according to pyrex, we're now encountering the following IO error message on a machine with ulimit (soft/hard) already set to 4096. 

The error message follows:

Paul Rigor

unread,
Apr 25, 2013, 6:19:00 PM4/25/13
to pygr...@googlegroups.com
Sure, we should have the time to implement this. We await the description =)

Cheers,
Paul

Paul Rigor

unread,
Apr 25, 2013, 6:19:18 PM4/25/13
to pygr...@googlegroups.com
Sure, we should have the time to implement this. We await the description =)

Cheers,
Paul


On Thu, Apr 25, 2013 at 2:58 PM, Christopher Lee <cjle...@gmail.com> wrote:

Paul Rigor

unread,
Apr 25, 2013, 3:14:17 PM4/25/13
to pygr...@googlegroups.com
Hi Chris,

Nevermind, I guess the warnings were actual errors. So I've changed the __new__ to __cinit__ according to the messages. All seems well and the module compiles. We'll keep you posted on the actual NLMSA building for the MAFs, in case we encounter anything else.

Thanks again,
On Thu, Apr 25, 2013 at 12:07 PM, Paul Rigor <paul....@uci.edu> wrote:
Hi Chris,

So which version of Pyrex and GCC is the most compatible? Currently, when attempting to recompile using the modifications you recommended, we get the following compilation warnings and error below. We have used pyrex versions 0.8.2.x - 0.9.9 and gcc 4.1.2, 4.3.0, & 4.7.1.

THanks!
Paul

pyrexc pygr/cnestedlist.pyx --> pygr/cnestedlist.c
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:8:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:51:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:167:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:371:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:424:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:446:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:1112:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:1138:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:167:40: Warning: 'not None' will become the default in a future version of Pyrex. Use 'or None' to allow passing None.
building 'pygr.cnestedlist' extension
creating build/temp.linux-x86_64-2.6/pygr/apps
gcc -pthread -pg -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/baldig/shared_libraries/centos64/pkgs/python/2.6.5/include/python2.6 -c pygr/intervaldb.c -o build/temp.linux-x86_64-2.6/pygr/intervaldb.o
gcc -pthread -pg -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/baldig/shared_libraries/centos64/pkgs/python/2.6.5/include/python2.6 -c pygr/cnestedlist.c -o build/temp.linux-x86_64-2.6/pygr/cnestedlist.o
pygr/cnestedlist.c:1:2: error: #error Do not use this file, it is the result of a failed Pyrex compilation.
error: command 'gcc' failed with exit status 1

On Thu, Apr 25, 2013 at 11:58 AM, Christopher Lee <cjle...@gmail.com> wrote:

Paul Rigor

unread,
Apr 25, 2013, 3:07:32 PM4/25/13
to pygr...@googlegroups.com
Hi Chris,

So which version of Pyrex and GCC is the most compatible? Currently, when attempting to recompile using the modifications you recommended, we get the following compilation warnings and error below. We have used pyrex versions 0.8.2.x - 0.9.9 and gcc 4.1.2, 4.3.0, & 4.7.1.

THanks!
Paul

pyrexc pygr/cnestedlist.pyx --> pygr/cnestedlist.c
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:8:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:51:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:167:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:371:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:424:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:446:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:1112:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:1138:2: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/prigor/scratch/pygr/pygr-0.8.2/pygr/cnestedlist.pyx:167:40: Warning: 'not None' will become the default in a future version of Pyrex. Use 'or None' to allow passing None.
building 'pygr.cnestedlist' extension
creating build/temp.linux-x86_64-2.6/pygr/apps
gcc -pthread -pg -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/baldig/shared_libraries/centos64/pkgs/python/2.6.5/include/python2.6 -c pygr/intervaldb.c -o build/temp.linux-x86_64-2.6/pygr/intervaldb.o
gcc -pthread -pg -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/baldig/shared_libraries/centos64/pkgs/python/2.6.5/include/python2.6 -c pygr/cnestedlist.c -o build/temp.linux-x86_64-2.6/pygr/cnestedlist.o
pygr/cnestedlist.c:1:2: error: #error Do not use this file, it is the result of a failed Pyrex compilation.
error: command 'gcc' failed with exit status 1



On Thu, Apr 25, 2013 at 11:58 AM, Christopher Lee <cjle...@gmail.com> wrote:

Christopher Lee

unread,
Apr 25, 2013, 8:01:16 PM4/25/13
to pygr...@googlegroups.com, paul....@uci.edu

* define a C struct something like this e.g. in intervaldb.h

typedef struct {
  int previous;
  int next;
  FILE *ifile;
  int nbuild;
} FileQueue;

it will represent a doubly linked list, to keep track of the order in which we most recently wrote to different build files.  I.e. each node in the list has a link to the node  (file) that was written to just before it, and to the node (file) that was written to just after it.  We keep an array of FileQueue nodes; the previous / next fields just store the integer array index of the corresponding nodes (previous or next).  We keep two separate variables that store the index of the newest (most recently written to) node and oldest (least recently written to) node.

When we write to an existing file, we pop it to the newest end of the linked list, by connecting its next and previous nodes to each other, and make current newest node's next field point to it (and set newest variable to its index).

When we have too many open files (larger than some preset limit), we unlink the oldest node from the linklist, and close its ifile, set its ifile to NULL.  Then we will be able to add a new node with a newly opened ifile.

When we want to write to a node that has NULL ifile, again we unlink and close the oldest node, and reopen the desired node in mode binary-append.  Then we can write to it as needed, and add it back to the newest end of the linked list.

changes to NLMSA.readMAFfiles():

* replace build_ifile[] and nbuild[] arrays by FileQueue.  It probably should be dynamically allocated e.g. using CALLOC / REALLOC macros defined in default.h

Paul Rigor

unread,
Apr 28, 2013, 9:03:30 PM4/28/13
to pygr...@googlegroups.com
Hi Chris,

I don't think replacing build_ifile and nbuild arrays with the FileQueue you mentioned will be this straight forward. These two variables are used outside of the readMAFfiles method, eg, loading indexes later on. 

Also, there are other implicit counters to nlmsa objects (thus their associated interval files) that will need to be maintained, eg, inlmsa and self.id. The linear id scheme for the interval files is not obviously amenable to LRU caching. 

Also, the creation of new lpo sequence cannot be easily bound to the new filecache -- it's used everywhere and i'm not sure about all of the dependencies for other opened file handles. Further, it's unclear how to open an associated interval file once it's closed. In other words, the code (from the latest git repo) isn't self-explanatory at the moment. Additionally, the saveInterval() method is quite confusing. What is it actually doing? It's argument list isn't consistent with examples of actual calls. The same goes for the newSequence() method.

I'm trying to piece together a solution using the LRUcache extension from the PyTables project, but not modifying the newSequence() method throws things off because of its implicit id generation.

What is the best way to isolate the changes? Again, I've just gone through the relevant code the past couple of days, so I'm probably misunderstanding a few things ;-)

Thank you again for your time!
On Thu, Apr 25, 2013 at 5:01 PM, Christopher Lee <cjle...@gmail.com> wrote:

* define a C struct something like this e.g. in intervaldb.h


 
it will represent a doubly linked list, to keep track of the order in which we most recently wrote to different build files.  I.e. each node in the list has a link to the node  (file) that was written to just before it, and to the node (file) that was written to just after it.  We keep an array of FileQueue nodes; the previous / next fields just store the integer array index of the corresponding nodes (previous or next).  We keep two separate variables that store the index of the newest (most recently written to) node and oldest (least recently written to) node.

--

Christopher Lee

unread,
Apr 29, 2013, 2:03:00 PM4/29/13
to pygr...@googlegroups.com
Hi Paul,
sorry, this mix of Python, Pyrex and C code is awfully dense and hard to make sense of. However, based on looking at the code a few days ago, I think the task of limiting the number of files that are opened at once during readMAFfiles() can be done in the relatively simple way I outlined. The build_ifile array is completely internal to readMAFfiles(); it is not passed to any other function. Note that the later call to buildFiles() (and hence to each NLMSASequence.build_files()) as its first step simply closes the build_ifile on each NLMSASequence (we'd only need to make very minor adjustments to that code). So once readMAFfiles() is done writing, everything else is done one file at a time. Thus our task really does not extend outside readMAFfiles() itself.

It sounds like it'd be most efficient if I try to write code for this over the next few days, then you can take a look at the changes and see what you think...

The question of limiting the number of files that are opened during regular usage of the NLMSA (i.e. querying the alignment database) is completely separate. I believe the current default mode of opening files only "onDemand" should keep the number of files from getting too big. If we need to, we can later add code for again automatically closing some files if the number gets too big.

Chris

Paul Rigor

unread,
Apr 30, 2013, 3:38:05 PM4/30/13
to pygr...@googlegroups.com
Hi Chris, thanks so much for taking care of this. Looking forward to testing the feature. Cheers,

Christopher Lee

unread,
May 7, 2013, 6:40:36 PM5/7/13
to pygr...@googlegroups.com, paul....@uci.edu
Hi Paul,
I implemented the method I described for limiting the maximum number of files open at any time.  I have only performed very limited testing (on sacCer3 multiz7way MAF files), which simply showed that using this feature (by passing low value  maxOpenFiles2=5 to NLMSA constructor) yielded identical index files vs. the standard mode not applying this limit.

Maybe you could test this on your data.  You don't need to change anything in the parameters you pass to NLMSA construction, since its new default values should work OK for you:

* maxOpenFiles=1024 default is used only for requesting the OS to permit us to have up to 1024 open files;
* maxOpenFiles2=1000 default will limit readMAFfiles() to 1000 open files at any one time, which should work for you.

The code is available from my rebuf branch of my pygr repo on github:
https://github.com/cjlee112/pygr/tree/rebuf

Let me know whatever problems you run into...

Yours,

Chris

Paul Rigor

unread,
May 7, 2013, 10:35:03 PM5/7/13
to pygr...@googlegroups.com
Hi Chris,

Thanks so much for the update. I did have a chance grab, compile, and run the code. However, on my tests, I came across the same problem. However, now we've reach an unexpected number of open files, namely, 2,097,148. I've set both of the maxFileOpen/2 parameters to 4096 each. Are there supposed to be that many number of open files?

The error message is below:

INFO buildNLMSA.main: Processing MAF file: /nonsvn/data/UCSC/genomes/MOUSE/mm10/maf/chr17.maf
INFO:pygr-log:Processing MAF file: /nonsvn/data/UCSC/genomes/MOUSE/mm10/maf/chr17.maf
[Traceback (most recent call last):
  File "/home/prigor/codebase/genomics/trunk/code/python/scripts/buildNLMSA.py", line 93, in <module>
    main()
  File "/home/prigor/codebase/genomics/trunk/code/python/scripts/buildNLMSA.py", line 79, in main
    maxOpenFiles2=4096)
  File "cnestedlist.pyx", line 1508, in pygr.cnestedlist.NLMSA.__init__
  File "cnestedlist.pyx", line 1794, in pygr.cnestedlist.NLMSA.readMAFfiles
  File "cnestedlist.pyx", line 1622, in pygr.cnestedlist.NLMSA.newSequence
  File "cnestedlist.pyx", line 1326, in pygr.cnestedlist.NLMSASequence.__init__
IOError: unable to open in write mode: /nonsvn/pygrdata/alignments/MOUSE/mm10/multiz60way/2097148.build

All the best,
--

Christopher Lee

unread,
May 8, 2013, 1:16:24 AM5/8/13
to pygr...@googlegroups.com, paul....@uci.edu
Hi Paul,
I don't know what parameters you are using, but let me clarify a few things:

- the number of .build files is NOT the number of open files.  It is writing to the .build files in parallel but automatically limits the number that are actually *open* at any one time based on the value of maxOpenFiles2.

- your error message indicates to me that the linked-list queue method of limiting the number of open files is operating, because your OS is not going to let you open 2 million files (or even more than a few thousand) at once!

- the fact that it's generating such a crazy number of .build files suggests either that you're using an odd set of parameters that's forcing it to split up the union coordinate system in an extremely inefficient way, or that the total coordinate space of all the sequences in the alignment is insanely huge (approximately 4 x 10^15).  The latter doesn't make sense to me, so what exactly are your build parameters?  Based on your comments about setting maxOpenFiles=4096 it sounds like you're not following what I suggested, that is, just calling NLMSA construction with the default parameter settings (no keyword argument values).

At this point what we need to do is to ensure that the union coordinate system is getting split in the usual, efficient way.  Note that the new "rebuf" code did not alter that aspect of things in any way.  Why not just run NLMSA construction with default values (i.e. don't pass in any kwarg parameter values)...

We may end up needing to debug this "live" e.g. with skype screen sharing, as that will be a lot more efficient than going back and forth with email messages.

-- Chris



On Tuesday, May 7, 2013 7:35:03 PM UTC-7, Paul Rigor wrote:
Hi Chris,

Christopher Lee

unread,
May 8, 2013, 1:37:12 AM5/8/13
to pygr...@googlegroups.com, paul....@uci.edu
Hi Paul,
looking back at Vishal's original message it seems to me there is a more basic problem: his error message "MAF block too long" indicates that the hardcoded limit on the number of intervals in a single MAF record is being exceeded.  None of the existing NLMSA kwarg options will get around that; it has to be changed in the source code.  I think I'll change this to use dynamic memory allocation (so it will just expand the buffer if it needs to, no more hardcoded size limit).  I'll let you know when I push the new code.

-- Chris

Christopher Lee

unread,
May 8, 2013, 2:30:06 AM5/8/13
to pygr...@googlegroups.com, paul....@uci.edu
I pushed new changes to my rebuf branch that addresses the original, separate problem that Vishal reported ("MAF block too long").  Please get the latest rebuf code as you need this to avoid that problem (as I explained in the previous post, that limit was hard-coded; none of your parameter settings could resolve that).

-- Chris

Paul Rigor

unread,
May 8, 2013, 11:28:06 AM5/8/13
to pygr...@googlegroups.com
Hi Christopher,

So after increasing the size of the interalmap array to 16384 (in cnestedlist.pyx: 1731 and 1774), and using the default parameters to the NLMSA object instantiation, the NLMSA has been successfully built!

Thank you so much for your help,
On Tue, May 7, 2013 at 11:30 PM, Christopher Lee <cjle...@gmail.com> wrote:
I pushed new changes to my rebuf branch that addresses the original, separate problem that Vishal reported ("MAF block too long").  Please get the latest rebuf code as you need this to avoid that problem (as I explained in the previous post, that limit was hard-coded; none of your parameter settings could resolve that).

-- Chris

Christopher Lee

unread,
May 8, 2013, 5:31:54 PM5/8/13
to pygr...@googlegroups.com, paul....@uci.edu
Hi Paul,
great.  Is there a reason you didn't use the new rebuf code I pushed last night, which does away with the hardcoded limit entirely?  I'd be very grateful if you could test the new code on the same MAF data.  Also, what was the total number of .idb index files generated by the build?

Yours,

Chris

Vishal Patel

unread,
Jun 5, 2013, 1:52:25 PM6/5/13
to pygr...@googlegroups.com
Hi Chris, 

We are still having trouble with the MSA. We were able to build it without any errors using pygr/0.8.2

However when we try to load and query it, we get the following error. 

In [1]: from pygr import worldbase

In [2]: genome = worldbase.Bio.Seq.Genome.MOUSE.mm10()

In [3]: msa = worldbase.Bio.Seq.Alignments.MOUSE.mm10.multiz60way()

In [4]: slice = genome['chrY']

In [5]: res = msa[slice]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-628143efd160> in <module>()
----> 1 res = msa[slice]

pygr/msa-builder/build/lib.linux-x86_64-2.6/pygr/cnestedlist.so in pygr.cnestedlist.NLMSA.__getitem__()

pygr/msa-builder/build/lib.linux-x86_64-2.6/pygr/cnestedlist.so in pygr.cnestedlist.NLMSASlice.__cinit__()

AttributeError: 'pygr.cnestedlist.NLMSA' object has no attribute 'doSlice'

Any clue? 

Vishal



On Thursday, April 25, 2013 10:23:33 AM UTC-7, Vishal Patel wrote:

Christopher Lee

unread,
Jun 5, 2013, 7:37:43 PM6/5/13
to pygr...@googlegroups.com
You are querying with the entire Y chromosome in your example, 91 MB in size, vs. a 60 genome alignment, is going to be a gigantic amount of aligned intervals.  That may cause serious problems.  I think you should try breaking that up into shorter intervals.  E.g. start by trying with a 1 kb interval query, and then try longer queries if you need to.  You should get the same results (querying 1 kb at a time) as you would get querying the whole thing...

This error message is a weird traceback, which appears to show that Pyrex try-except handling is not working right.  The doSlice call is inside a try... except AttributeError clause, so it should be impossible for it to raise an AttributeError!  Can you try re-compiling from source, i.e. deleted cnestedlist.c, rerun the build (should remake cnestedlist.c using pyrexc)?  Something seems to be wrong with your build.

-- Chris

Paul Rigor

unread,
Jun 6, 2013, 7:10:37 PM6/6/13
to pygr...@googlegroups.com
Hi Chris,
The sample command above was the last of a series of slices we attempted. But we'll definitely need to adjust the number of intervals when building the NLMSA. The default settings generates huge files that take a little bit too long to load.

For the AttributeError, I've actually tried several versions of pyrex (0.9.8, 0.9.8.2, 0.9.9), but I still can't re-compile the pyx files to c. For all versions of pyrex, we get the following error when attempting to compile cndestedlist.pyx and cdict.pyx

pyrexc cnestedlist.pyx
/home/prigor/projects/msabuilder2/pygr/cnestedlist.pyx:7:55: Expected ')'

pyrexc cdict.pyx 
/home/prigor/projects/msabuilder2/pygr/cdict.pyx:129:23: Syntax error in C variable declaration

Which version of pyrex do you recommend we use?
Thanks!

Christopher Lee

unread,
Jun 6, 2013, 9:32:16 PM6/6/13
to pygr...@googlegroups.com
Not sure what the problem with your source code is.  Pyrexc 0.9.8.6 runs fine on my latest cnestedlist.pyx, just prints a few warnings:

(vehome)[user@work pygr]$ pyrexc --version
Pyrex version 0.9.8.6
(vehome)[user@work pygr]$ pyrexc cnestedlist.pyx
/home/user/projects/pygr/pygr/cnestedlist.pyx:8:2: Warning: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/user/projects/pygr/pygr/cnestedlist.pyx:51:2: Warning: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/user/projects/pygr/pygr/cnestedlist.pyx:167:2: Warning: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/user/projects/pygr/pygr/cnestedlist.pyx:371:2: Warning: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/user/projects/pygr/pygr/cnestedlist.pyx:424:2: Warning: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/user/projects/pygr/pygr/cnestedlist.pyx:446:2: Warning: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/user/projects/pygr/pygr/cnestedlist.pyx:1112:2: Warning: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
/home/user/projects/pygr/pygr/cnestedlist.pyx:1138:2: Warning: __new__ method of extension type will change semantics in a future version of Pyrex. Use __cinit__ instead.
(vehome)[user@work pygr]$ ls -l cnestedlist.*
-rw-r--r-- 1 user user  936552 Jun  6 18:26 cnestedlist.c
-rw-r--r-- 1 user user    7452 May  7 23:02 cnestedlist.pxd
-rw-r--r-- 1 user user   97515 May  7 23:09 cnestedlist.pyx


BTW I don't know what you mean by "we'll definitely need to adjust the number of intervals when building the NLMSA"...

Paul Rigor

unread,
Jun 20, 2013, 4:54:20 AM6/20/13
to Christopher Lee, pygr...@googlegroups.com
Hi Chris,
So I got in touch with the Pyrex developer and he re-posted version 0.9.8.6 (among other versions). But I still don't get a clean compilation even with the same version of pygr you use. In fact, I would have to rename __new__ to __cinit__ because otherwise the cnestedlist.pyx isn't parsed (ig, the c code is empty). Which version of GCC do you use?
Thanks again!
On Tue, Jun 18, 2013 at 9:23 AM, Paul Rigor <paul....@uci.edu> wrote:
Hi Chris,
Sorry for the late reply, it was finals week last week.
Would you mind emailing me the source for Pyrex 0.9.8.6? It's not available from the pyrex site's old release archive.

When compiling using 0.9.9, 0.9.8.2, 0.9.8 -- there are tons more warning messages than what you've listed. I do hope it's just an issue with the version of pyrex.

By the way, I misspoke regarding the 'number of intervals', I believe I was referring to both 'maxint' and 'maxlen' where we wanted to decrease the filesize of lpo db files because i/o performance becomes bottleneck for our application. We'd prefer load several small lpo files (in the order KB) rather than a giant one (in the order of GB) across thousands of separate processes. 

Thanks again,
Reply all
Reply to author
Forward
0 new messages