Scotty's anti-matter inner engine bits of h2o [Re: ArrayIndexOutOfBounds]

38 views
Skip to first unread message

Sri Ambati

unread,
Nov 26, 2014, 12:57:08 PM11/26/14
to h2ostream, engr
cc. community.

Begin forwarded message:

From: Cliff Click <ccl...@gmail.com>
Date: November 26, 2014 at 11:30:56 AM CST
To: "Lavdas, Steve" <Steve....@nielsen.com>
Cc: Josephine Wang <jose...@0xdata.com>, support <sup...@0xdata.com>, "en...@0xdata.com" <en...@0xdata.com>
Subject: Re: ArrayIndexOutOfBounds

You want an R-bind like effect.

Actively being worked on in h2o-dev - so is this in h2o-1 or h2o-dev?

... but I can tell you how to do it, 'cause you'll probably get there first:

You're gonna make a new data-layout - and this is core h2o functionality we normally try really hard to paper-over from day-to-day users, so it's not easy to do and involves understanding some of the magic.  Think StarTrek:  Scotty just lowered the shields on the core antimatter drives, and is pointing you at all the glowing antimatter engine bits.


Data-layout is composed of 2 parts
- Vec/VectorGroup/Chunk Keys having a magic layout (there are functions to make & tear apart those Keys) - and a hash function on the Keys which defines their home node, and thus the primary copy of the data.  You WONT need to (cant!) touch the hash function, but you will need to make Keys via the special functions
- The "espc" array - ElementSPerChunk - which is an array of longs, one entry per Chunk in the Vec, plus a final entry.  The contents are actually row-starts per-Chunk, so always espc[0]==0, and espc[nchunks]==number_of_rows.

Now things go fast:

- Compute a new ESPC layout based on keeping all the Chunks the same, just changing their row offsets.

- Pick the largest Frame to "not move" - keep it's Vec, VectorGroup, Chunks & Keys.  For everybody else, use a MRTask to parallel get all the Chunks.  For each Chunk, make a new Key based on the not-moving Vec, having a Chunk-index past the end of the original Vec (pre-arrange Chunk#'s before you start), update any Chunk internal fields related to the global _start row# and the owning _vec; flush the _chk2 cache.

- Do a DKV.put(chunkKey,Chunk) - which will replicate the data under the new ChunkKey.  Delete the old one.  (If you want the old one to remain, you'll have to clone() the Chunk proper, because you'll need to edit the Chunks before pushing them back into the DKV; if you plan on deleting the old data, you can edit the Chunks in-place).

- After the MRTask, update the Vec(s) with new larger espc arrays; you'll need to do DKV put's again on the modified Vec headers.

... and Bob's Yer Uncle!  Or something like that.
Let me know how it goes, 'cause you're about to head into core core core of H2O.

Cliff


On 11/26/2014 9:08 AM, Lavdas, Steve wrote:
Ah ok, actually I was doing this as a temp workaround anyway. My real goal is to build up a Frame from other existing frames which have the same columns.

Ideally I’d like to create an MRTask2 that can do this in a scalable  way. But MRTask2 outputs to a single/new Frame. Is there a way to append to an existing frame?

So I have frameA with say 1 million rows, frameB with 2 million and frameC with 4 million. All with the same layout.
I want to create frameD and append frames A,B,C to create a final frame of 7 million rows.

On Nov 26, 2014, at 12:03 PM, Cliff Click <ccl...@gmail.com> wrote:

Hi Josephine & Steve - yes exactly you hit a limit, which is actually a JVM limit (not a Java language limit, not an H2O limit).  Been there for 20 years...  so probably not getting fixed anytime soon!

The "fix" is indeed to use several Chunks.

fyi - the frame utils routines you are using we put in for testing infrastructure, they are single-threaded and currently mostly used to make tiny test datasets.  If you're having speed issues, we can probably find a way to parallelize & distribute your work using some of the other H2O infrastructure.

Cliff


On 11/26/2014 8:05 AM, Josephine Wang wrote:
 
Thank you!
 
Josephine Wang
Director of Customer Experience
0xdata Inc.
(847) 827-3637 Office
(917) 861-4242 Mobile
 
From: Lavdas, Steve [mailto:Steve....@nielsen.com] 
Sent: Wednesday, November 26, 2014 9:28 AM
To: Josephine Wang
Subject: ArrayIndexOutOfBounds
 
Hi Josephine,
   I’m getting this error calling 
FrameUtils.frame(String[] names, double[]... rows) 
 
 
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 8388608
        at water.fvec.NewChunk.append2slowd(NewChunk.java:296)
        at water.fvec.NewChunk.addNum(NewChunk.java:190)
        at water.util.FrameUtils.frame(FrameUtils.java:40)
 
It looks like the amount of rows I have is too much for that method to handle using one chunk.
I think it may need to create multiple chunks in general (this is using that latest stable release)
 
Thanks,
Steve
PS Have a great Thanksgiving!
 
 




Reply all
Reply to author
Forward
0 new messages