You want an R-bind like effect.
Actively being worked on in h2o-dev - so is this in h2o-1 or
h2o-dev?
... but I can tell you how to do it, 'cause you'll probably get
there first:
You're gonna make a new data-layout - and this is core h2o
functionality we normally try really hard to paper-over from
day-to-day users, so it's not easy to do and involves
understanding some of the magic. Think StarTrek: Scotty just
lowered the shields on the core antimatter drives, and is pointing
you at all the glowing antimatter engine bits.
Data-layout is composed of 2 parts
- Vec/VectorGroup/Chunk Keys having a magic layout (there are
functions to make & tear apart those Keys) - and a hash
function on the Keys which defines their home node, and thus the
primary copy of the data. You WONT need to (cant!) touch the hash
function, but you will need to make Keys via the special functions
- The "espc" array - ElementSPerChunk - which is an array of
longs, one entry per Chunk in the Vec, plus a final entry. The
contents are actually row-starts per-Chunk, so always espc[0]==0,
and espc[nchunks]==number_of_rows.
Now things go fast:
- Compute a new ESPC layout based on keeping all the Chunks the
same, just changing their row offsets.
- Pick the largest Frame to "not move" - keep it's Vec,
VectorGroup, Chunks & Keys. For everybody else, use a MRTask
to parallel get all the Chunks. For each Chunk, make a new Key
based on the not-moving Vec, having a Chunk-index past the end of
the original Vec (pre-arrange Chunk#'s before you start), update
any Chunk internal fields related to the global _start row# and
the owning _vec; flush the _chk2 cache.
- Do a DKV.put(chunkKey,Chunk) - which will replicate the data
under the new ChunkKey. Delete the old one. (If you want the old
one to remain, you'll have to clone() the Chunk proper, because
you'll need to edit the Chunks before pushing them back into the
DKV; if you plan on deleting the old data, you can edit the Chunks
in-place).
- After the MRTask, update the Vec(s) with new larger espc arrays;
you'll need to do DKV put's again on the modified Vec headers.
... and Bob's Yer Uncle! Or something like that.
Let me know how it goes, 'cause you're about to head into core
core core of H2O.
Cliff
On 11/26/2014 9:08 AM, Lavdas, Steve wrote:
Ah ok, actually I was doing this as a temp workaround anyway. My
real goal is to build up a Frame from other existing frames which
have the same columns.
Ideally I’d like to create an MRTask2 that can do this in a
scalable way. But MRTask2 outputs to a single/new Frame. Is
there a way to append to an existing frame?
So I have frameA with say 1 million rows, frameB with 2
million and frameC with 4 million. All with the same layout.
I want to create frameD and append frames A,B,C to create a
final frame of 7 million rows.
Hi Josephine & Steve -
yes exactly you hit a limit, which is actually a JVM
limit (not a Java language limit, not an H2O limit).
Been there for 20 years... so probably not getting
fixed anytime soon!
The "fix" is indeed to use several Chunks.
fyi - the frame utils routines you are using we put in
for testing infrastructure, they are single-threaded and
currently mostly used to make tiny test datasets. If
you're having speed issues, we can probably find a way
to parallelize & distribute your work using some of
the other H2O infrastructure.
Cliff
On 11/26/2014 8:05 AM, Josephine Wang wrote:
Thank you!
Josephine
Wang
Director
of Customer Experience
0xdata
Inc.
(847)
827-3637 Office
(917)
861-4242 Mobile
From: Lavdas,
Steve [mailto:Steve....@nielsen.com]
Sent: Wednesday,
November 26, 2014 9:28 AM
To: Josephine
Wang
Subject: ArrayIndexOutOfBounds
Hi Josephine,
I’m getting this error calling
FrameUtils.frame(String[] names, double[]...
rows)
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 8388608
at water.fvec.NewChunk.append2slowd(NewChunk.java:296)
at water.fvec.NewChunk.addNum(NewChunk.java:190)
at water.util.FrameUtils.frame(FrameUtils.java:40)
It looks like the amount of rows I have is too
much for that method to handle using one chunk.
I think it may need to create multiple chunks in
general (this is using that latest stable release)
PS Have a great Thanksgiving!