Efficient way to store and update large 2D arrays

schaffer

unread,

May 14, 2012, 11:18:02 AM5/14/12

to mongod...@googlegroups.com

Hi,

I am looking for the most efficient way to store multiple large 2D arrays that will allow large numbers of updates (increments) per element from multiple processes and servers.

The arrays would be of the size 100k*100k or greater. A new instance of the array would be created every 15 minutes as we need to record this information over time intervals.

There could be up to 10 of these arrays being updated at any one time .

A number of possible options spring to mind:

1. Keep 2D Array in application/server memory and write out to DB every X period(15 minutes or so). Each server would need to do the same and merge data with queries (map/reduce, aggregation framework or similar).

Pros: Simple DB write. No 2D array updates in DB.

Cons: Server crashes and the data is lost. More application memory used

2. Store data in a grid matrix, 100k documents with an array containing 100k elements/documents per 2D array.

Pros: Easy to implement and update values. Non-volatile data.

Cons: Huge number of rows and columns required over time. 100k rows every 15 minutes. Potential write/update lock issue?

3. Each array is stored as a document as follows:

{"2DArray":[

{

"0":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,...],

"1":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,...],

"2":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,...],

...

}

]

}

Pros: one row per 2D array, Non-volatile data.

Cons: write locks. Not sure on efficiency and speed of updates

Any help and advice on how best to implement this?

Thanks

Andy Schwerin

unread,

May 14, 2012, 1:01:28 PM5/14/12

to mongod...@googlegroups.com

Can you tell us a little more about your use case? Are the arrays sparse? In the course of each 15 minute period, do most of the cells in the array get updated, or typically do only a small number of cells get updated? What are the read patterns like? What queries will you want to make? What kinds of analyses.

-Andy

schaffer

unread,

May 14, 2012, 1:29:10 PM5/14/12

to mongod...@googlegroups.com

Sure.

The arrays are used to create real-time heatmaps of people/item movements on a region basis (one array per region).

There could be as many as 100k updates per second spread over the arrays. Some areas could be sparsely populated but it will be fairly random with some areas heavily updated.

Reads will be kept to a minimum. I would estimate < 10 reads per minute. Queries would be to pull back all data by array.

Thanks

Wes Freeman

unread,

May 14, 2012, 1:31:58 PM5/14/12

to mongod...@googlegroups.com

If you're not trying to do any queries on the individual data points while it's in the database, you could just store it serialized in a BinData type format.

Wes

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

schaffer

unread,

May 14, 2012, 1:41:38 PM5/14/12

to mongod...@googlegroups.com

I don't know too much about BinData.

Would you store this per array or for the complete 2D array?

Also is it possible/efficient to update array elements?

Wes Freeman

unread,

May 14, 2012, 1:55:11 PM5/14/12

to mongod...@googlegroups.com

I was under the impression you wanted to just dump the full dataset into mongo, and then pull it back in full. If that's not the case, my BinData recommendation is probably no good. I don't think you would be able to update individual array elements with BinData. Sorry, I guess I missed the second half of your first sentence above above incrementing elements.

--

Andy Schwerin

unread,

May 14, 2012, 2:47:05 PM5/14/12

to mongod...@googlegroups.com

My first inclination would be to store each cell of the array in a separate document, and identify the arrays by some kind of array id. When you bump your generation every 15 minutes, you could either start using a new array id, or you could identify the arrays by an array id and a generation id. You could get some savings in space by having absent cells stand in for 0 values, if you're really just storing counters in each cell.

Queries would consist of finding all the documents for a particular array.

A cell document would look like: { aid: array_id, gid: generation_id, r: row, c: column, <counters and other values>}

You can get write scalability in a design like this by sharding the collection across many machines. To get the whole contents of an array, you'd query for all documents matching a particular (aid, gid) pair. You could observe changes over time for a particular cell by querying for (aid, gid, r, c).

Consistent reads of the whole array will be more problematic, since there is no one document representing the state of the whole array.

-Andy

schaffer

unread,

May 14, 2012, 8:30:24 PM5/14/12

to mongod...@googlegroups.com

Thanks. That's an interesting way of doing it.

I'd have to consider the number of documents that would be created per day using this approach, which could be a few billion.

Processing and storing that could be a bit of an issue.

I was looking for something simple and straightforward :-)

The application memory option is looking the most promising.....

Reply all

Reply to author

Forward