Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Mongo + ETL
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  11 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Brandon Parise  
View profile  
 More options May 17 2012, 8:48 am
From: Brandon Parise <bpar...@gmail.com>
Date: Thu, 17 May 2012 05:48:32 -0700 (PDT)
Local: Thurs, May 17 2012 8:48 am
Subject: Mongo + ETL
We want to use MongoDB to store documents of various 3rd party
providers but need to hook up an ETL process for warehousing.  These
documents will change infrequently and there will be tons of them (5M
+).  What are some ways that we can "flag" documents that were
actually modified and need to be ETL'd?

I was thinking about maintaining a hash in the document itself, which
represents an md5() of the actual data members:

{
  _id: ObjectId('1xxx'),
  hash: 'md5-of-actual-data-members',

  # actual data members start here
  name: 'foo',
  value: 'bar'

}

# in our import
hash = md5(/** new data members stringified **/);
if (doc.hash !== hash) then flag the document to be ETL'ed (push the
_id onto an array for the ETL to persist)

Thoughts?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sam Millman  
View profile  
 More options May 17 2012, 8:53 am
From: Sam Millman <sam.mill...@gmail.com>
Date: Thu, 17 May 2012 13:53:45 +0100
Local: Thurs, May 17 2012 8:53 am
Subject: Re: [mongodb-user] Mongo + ETL

Can't you just do a change = true flag? I mean the MD5 wouldn't be awesome
since you would have to pull out every doc to understand if it needs to to
be ETL'ed (assuming each doc is different and sow ould have its own old md5
and new md5) which wouldn't be nice at all.

On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Brandon Parise  
View profile  
 More options May 17 2012, 9:17 am
From: Brandon Parise <bpar...@gmail.com>
Date: Thu, 17 May 2012 06:17:25 -0700 (PDT)
Local: Thurs, May 17 2012 9:17 am
Subject: Re: [mongodb-user] Mongo + ETL

Well, this would be part of an import process.  We have to "sync" these
entities from the 3rd party providers to our system... most of the time
they will not change, but some will.  I don't want to se changed=true for
all documents because most will actually not be modified and we dont want
to re-run those through the ETL process.

The hashing would happen as part of the sync .. maybe a findAndModify using
document hash != new hash.  This way I am only modifying those documents
that have different computed hashes!?

Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON
UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
gets updated and we ETL based on that DateTimeUpdate >= last ETL run
timestamp.

B


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sam Millman  
View profile  
 More options May 17 2012, 10:08 am
From: Sam Millman <sam.mill...@gmail.com>
Date: Thu, 17 May 2012 15:08:21 +0100
Local: Thurs, May 17 2012 10:08 am
Subject: Re: [mongodb-user] Mongo + ETL

Ah ok I get it now I thought this was gonna be like a cronjob synchro for a
min. Yea a hash can work there.

"The hashing would happen as part of the sync .. maybe a findAndModify
using document hash != new hash.  This way I am only modifying those
documents that have different computed hashes!?"

Took me a bit of time to decide how to reply to that bit and am still
unsure. Maybe you can explain what you mean by that? Maybe better to
explain which part of ETL this is. Are you grabbing from the source to put
into the Warehouse (transform)? This import; what step of the ETL is it or
is it a precursor?

"Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON
UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
gets updated and we ETL based on that DateTimeUpdate >= last ETL run
timestamp."

That sounds doable from what I can see and should work better than the
hashs.

On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Brandon Parise  
View profile  
 More options May 17 2012, 1:11 pm
From: Brandon Parise <bpar...@gmail.com>
Date: Thu, 17 May 2012 10:11:25 -0700 (PDT)
Local: Thurs, May 17 2012 1:11 pm
Subject: Re: [mongodb-user] Mongo + ETL

I guess a simple question at this point is how to emulate the
DateTimeUpdated column in MongoDB where it should only update if the
document was actually modified.  

$items = array(
    array('externalId' => 12345, 'foo' => 'bar')
);

foreach ($items as $item) {
    $item['_id'] = $item['externalId'];
    $collection->save($item);

}

the second time this would run shouldn't update the document because its
the same values.

I would like to either set $item['etl'] = true or something similar...

B


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sam Millman  
View profile  
 More options May 17 2012, 1:29 pm
From: Sam Millman <sam.mill...@gmail.com>
Date: Thu, 17 May 2012 18:29:13 +0100
Local: Thurs, May 17 2012 1:29 pm
Subject: Re: [mongodb-user] Mongo + ETL

Well this would be ok with ActiveRecord since ActiveRecords standard is to
detect changes and update on changes.

But another easy way is to judge when values (through using objects, like
in PHP) change. In PHP you can use the __set __get magics, you could make
make a wrapper you could also when you load the record into the object
store an old version of it self and then on save compare the two, there are
a couple of other ways. When you of course judge a change you set the flag

On 17 May 2012 18:11, Brandon Parise <bpar...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Brandon Parise  
View profile  
 More options May 22 2012, 9:44 am
From: Brandon Parise <bpar...@gmail.com>
Date: Tue, 22 May 2012 06:44:34 -0700 (PDT)
Local: Tues, May 22 2012 9:44 am
Subject: Re: [mongodb-user] Mongo + ETL

Sorry for taking so long to respond!!  I don't want to have to pull the
down the document in order to maintain the "dirty" state of the document.

This is the logic that I was thinking...

<?php

// example entity to persist
$entity = array(
    '_id' => 12345,
    'name' => 'foo',
    'bar' => 'baz'
);

// hash all the data members coming from the source
$hash = md5(serialize($entity));
$entity['hash'] = $hash;

// findandmodify the entity
$result = $db->command( array(
    'findAndModify' => 'entity_collection',
    'query' => array(
        '_id' => $entity['_id']
    ),
    'update' => $entity,
    'upsert' => true,
    'fields' => array('hash' => 1) // return the old hash
));

// if the old hash doesn't match the new one then the entity has been
updated
if ($result->hash != $entity['hash']) {
    // @todo push this document onto a stack to be ETL'ed

}

?>


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sam Millman  
View profile  
 More options May 22 2012, 9:46 am
From: Sam Millman <sam.mill...@gmail.com>
Date: Tue, 22 May 2012 14:46:47 +0100
Local: Tues, May 22 2012 9:46 am
Subject: Re: [mongodb-user] Mongo + ETL

Aye that looks good now that you've explained your logic :)

On 22 May 2012 14:44, Brandon Parise <bpar...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Brandon Parise  
View profile  
 More options May 22 2012, 12:14 pm
From: Brandon Parise <bpar...@gmail.com>
Date: Tue, 22 May 2012 09:14:53 -0700 (PDT)
Local: Tues, May 22 2012 12:14 pm
Subject: Re: [mongodb-user] Mongo + ETL

I guess I should've started off with that :) Been trying to find out how
others ETL from MongoDB and most approach it as they would rather take the
overhead of re-ETL'ing the same document (which would effectively have no
change anyways) than have a complex "middleman" to manage the "needs etl'ed
state'.  But, in our case we will have millions of documents so we want to
keep the process as lean as possible.

This was an interesting dive and hopefully someone who searches this group
finds this thread useful.

Thanks Sammaye!
B


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Varela  
View profile  
 More options May 22 2012, 1:38 pm
From: Varela <zmievsko...@gmail.com>
Date: Tue, 22 May 2012 10:38:52 -0700 (PDT)
Local: Tues, May 22 2012 1:38 pm
Subject: Re: Mongo + ETL
Hi,

I can suggest you to keep last update time and index it. Usually you
know when you change data and for what reason and if updated dataset
is small it would be very fast to check, you can combine this approach
with labels. I have ETL with collection with 2M of docs this works
very fast.

Regards,
Valeriy

On 22 май, 19:14, Brandon Parise <bpar...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sam Millman  
View profile  
 More options May 22 2012, 2:34 pm
From: Sam Millman <sam.mill...@gmail.com>
Date: Tue, 22 May 2012 19:34:15 +0100
Local: Tues, May 22 2012 2:34 pm
Subject: Re: [mongodb-user] Re: Mongo + ETL

Yea mongo isn't like SQL, where in SQL you are trying your darn hardest to
ensure you do as few queries and as complete as possible Mongo is
more..streamable on that front.

With an index on hash yours should be that bad but Varela makes a valid
point which is why I said the timestamp would work initially.

On 22 May 2012 18:38, Varela <zmievsko...@gmail.com> wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »