Mongo + ETL

365 views
Skip to first unread message

Brandon Parise

unread,
May 17, 2012, 8:48:32 AM5/17/12
to mongodb-user
We want to use MongoDB to store documents of various 3rd party
providers but need to hook up an ETL process for warehousing. These
documents will change infrequently and there will be tons of them (5M
+). What are some ways that we can "flag" documents that were
actually modified and need to be ETL'd?

I was thinking about maintaining a hash in the document itself, which
represents an md5() of the actual data members:

{
_id: ObjectId('1xxx'),
hash: 'md5-of-actual-data-members',

# actual data members start here
name: 'foo',
value: 'bar'
}

# in our import
hash = md5(/** new data members stringified **/);
if (doc.hash !== hash) then flag the document to be ETL'ed (push the
_id onto an array for the ETL to persist)

Thoughts?

Sam Millman

unread,
May 17, 2012, 8:53:45 AM5/17/12
to mongod...@googlegroups.com
Can't you just do a change = true flag? I mean the MD5 wouldn't be awesome since you would have to pull out every doc to understand if it needs to to be ETL'ed (assuming each doc is different and sow ould have its own old md5 and new md5) which wouldn't be nice at all.


--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb

Brandon Parise

unread,
May 17, 2012, 9:17:25 AM5/17/12
to mongod...@googlegroups.com
Well, this would be part of an import process.  We have to "sync" these entities from the 3rd party providers to our system... most of the time they will not change, but some will.  I don't want to se changed=true for all documents because most will actually not be modified and we dont want to re-run those through the ETL process.

The hashing would happen as part of the sync .. maybe a findAndModify using document hash != new hash.  This way I am only modifying those documents that have different computed hashes!?

Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp gets updated and we ETL based on that DateTimeUpdate >= last ETL run timestamp.

B

Sam Millman

unread,
May 17, 2012, 10:08:21 AM5/17/12
to mongod...@googlegroups.com
Ah ok I get it now I thought this was gonna be like a cronjob synchro for a min. Yea a hash can work there.


"The hashing would happen as part of the sync .. maybe a findAndModify using document hash != new hash.  This way I am only modifying those documents that have different computed hashes!?"

Took me a bit of time to decide how to reply to that bit and am still unsure. Maybe you can explain what you mean by that? Maybe better to explain which part of ETL this is. Are you grabbing from the source to put into the Warehouse (transform)? This import; what step of the ETL is it or is it a precursor?


"Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp gets updated and we ETL based on that DateTimeUpdate >= last ETL run timestamp."

That sounds doable from what I can see and should work better than the hashs.

Brandon Parise

unread,
May 17, 2012, 1:11:25 PM5/17/12
to mongod...@googlegroups.com
I guess a simple question at this point is how to emulate the DateTimeUpdated column in MongoDB where it should only update if the document was actually modified.  

$items = array(
    array('externalId' => 12345, 'foo' => 'bar')
);

foreach ($items as $item) {
    $item['_id'] = $item['externalId'];
    $collection->save($item);
}

the second time this would run shouldn't update the document because its the same values.

I would like to either set $item['etl'] = true or something similar...

B

Sam Millman

unread,
May 17, 2012, 1:29:13 PM5/17/12
to mongod...@googlegroups.com
Well this would be ok with ActiveRecord since ActiveRecords standard is to detect changes and update on changes.

But another easy way is to judge when values (through using objects, like in PHP) change. In PHP you can use the __set __get magics, you could make make a wrapper you could also when you load the record into the object store an old version of it self and then on save compare the two, there are a couple of other ways. When you of course judge a change you set the flag

Brandon Parise

unread,
May 22, 2012, 9:44:34 AM5/22/12
to mongod...@googlegroups.com
Sorry for taking so long to respond!!  I don't want to have to pull the down the document in order to maintain the "dirty" state of the document.

This is the logic that I was thinking... 

<?php

// example entity to persist
$entity = array(
    '_id' => 12345,
    'name' => 'foo',
    'bar' => 'baz'
);

// hash all the data members coming from the source
$hash = md5(serialize($entity));
$entity['hash'] = $hash;

// findandmodify the entity
$result = $db->command( array(
    'findAndModify' => 'entity_collection',
    'query' => array(
        '_id' => $entity['_id']
    ),
    'update' => $entity,
    'upsert' => true,
    'fields' => array('hash' => 1) // return the old hash
));

// if the old hash doesn't match the new one then the entity has been updated
if ($result->hash != $entity['hash']) {
    // @todo push this document onto a stack to be ETL'ed
}
?>

Sam Millman

unread,
May 22, 2012, 9:46:47 AM5/22/12
to mongod...@googlegroups.com
Aye that looks good now that you've explained your logic :)

Brandon Parise

unread,
May 22, 2012, 12:14:53 PM5/22/12
to mongod...@googlegroups.com
I guess I should've started off with that :) Been trying to find out how others ETL from MongoDB and most approach it as they would rather take the overhead of re-ETL'ing the same document (which would effectively have no change anyways) than have a complex "middleman" to manage the "needs etl'ed state'.  But, in our case we will have millions of documents so we want to keep the process as lean as possible.

This was an interesting dive and hopefully someone who searches this group finds this thread useful.

Thanks Sammaye!
B

Varela

unread,
May 22, 2012, 1:38:52 PM5/22/12
to mongodb-user
Hi,

I can suggest you to keep last update time and index it. Usually you
know when you change data and for what reason and if updated dataset
is small it would be very fast to check, you can combine this approach
with labels. I have ETL with collection with 2M of docs this works
very fast.

Regards,
Valeriy

On 22 май, 19:14, Brandon Parise <bpar...@gmail.com> wrote:
> I guess I should've started off with that :) Been trying to find out how
> others ETL from MongoDB and most approach it as they would rather take the
> overhead of re-ETL'ing the same document (which would effectively have no
> change anyways) than have a complex "middleman" to manage the "needs etl'ed
> state'.  But, in our case we will have millions of documents so we want to
> keep the process as lean as possible.
>
> This was an interesting dive and hopefully someone who searches this group
> finds this thread useful.
>
> Thanks Sammaye!
> B
>
>
>
>
>
>
>
> On Tuesday, May 22, 2012 9:46:47 AM UTC-4, Sammaye wrote:
>
> > Aye that looks good now that you've explained your logic :)
>
> >>>>> On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
>
> >>>>>> Well, this would be part of an import process.  We have to "sync"
> >>>>>> these entities from the 3rd party providers to our system... most of the
> >>>>>> time they will not change, but some will.  I don't want to se changed=true
> >>>>>> for all documents because most will actually not be modified and we dont
> >>>>>> want to re-run those through the ETL process.
>
> >>>>>> The hashing would happen as part of the sync .. maybe a findAndModify
> >>>>>> using document hash != new hash.  This way I am only modifying those
> >>>>>> documents that have different computed hashes!?
>
> >>>>>> Up to now we have been using mysql and leveraging a `DateTimeUpdated`
> >>>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
> >>>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
> >>>>>> timestamp.
>
> >>>>>> B
>
> >>>>>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
>
> >>>>>>> Can't you just do a change = true flag? I mean the MD5 wouldn't be
> >>>>>>> awesome since you would have to pull out every doc to understand if it
> >>>>>>> needs to to be ETL'ed (assuming each doc is different and sow ould have its
> >>>>>>> own old md5 and new md5) which wouldn't be nice at all.
>
> >>>>>>>> mongodb-user+unsubscribe@**googl****egroups.com<mongodb-user%2Bunsubscribe@ googlegroups.com>
> >>>>>>>> See also the IRC channel -- freenode.net#mongodb
>
> >>>>>>>  --
> >>>>>> You received this message because you are subscribed to the Google
> >>>>>> Groups "mongodb-user" group.
> >>>>>> To post to this group, send email to mongod...@googlegroups.com
> >>>>>> To unsubscribe from this group, send email to
> >>>>>> mongodb-user+unsubscribe@**googl**egroups.com<mongodb-user%2Bunsubscribe@go oglegroups.com>
> >>>>>> See also the IRC channel -- freenode.net#mongodb
>
> >>>>>  --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "mongodb-user" group.
> >>>> To post to this group, send email to mongod...@googlegroups.com
> >>>> To unsubscribe from this group, send email to
> >>>> mongodb-user+unsubscribe@**googlegroups.com<mongodb-user%2Bunsubscribe@goog legroups.com>
> >>>> See also the IRC channel -- freenode.net#mongodb
>
> >>>  --
> >> You received this message because you are subscribed to the Google
> >> Groups "mongodb-user" group.
> >> To post to this group, send email to mongod...@googlegroups.com
> >> To unsubscribe from this group, send email to
> >> mongodb-user...@googlegroups.com

Sam Millman

unread,
May 22, 2012, 2:34:15 PM5/22/12
to mongod...@googlegroups.com
Yea mongo isn't like SQL, where in SQL you are trying your darn hardest to ensure you do as few queries and as complete as possible Mongo is more..streamable on that front.

With an index on hash yours should be that bad but Varela makes a valid point which is why I said the timestamp would work initially.
Reply all
Reply to author
Forward
0 new messages