We want to use MongoDB to store documents of various 3rd party
providers but need to hook up an ETL process for warehousing. These
documents will change infrequently and there will be tons of them (5M
+). What are some ways that we can "flag" documents that were
actually modified and need to be ETL'd?
I was thinking about maintaining a hash in the document itself, which
represents an md5() of the actual data members:
# actual data members start here
name: 'foo',
value: 'bar'
}
# in our import
hash = md5(/** new data members stringified **/);
if (doc.hash !== hash) then flag the document to be ETL'ed (push the
_id onto an array for the ETL to persist)
Can't you just do a change = true flag? I mean the MD5 wouldn't be awesome
since you would have to pull out every doc to understand if it needs to to
be ETL'ed (assuming each doc is different and sow ould have its own old md5
and new md5) which wouldn't be nice at all.
On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
> We want to use MongoDB to store documents of various 3rd party
> providers but need to hook up an ETL process for warehousing. These
> documents will change infrequently and there will be tons of them (5M
> +). What are some ways that we can "flag" documents that were
> actually modified and need to be ETL'd?
> I was thinking about maintaining a hash in the document itself, which
> represents an md5() of the actual data members:
> # actual data members start here
> name: 'foo',
> value: 'bar'
> }
> # in our import
> hash = md5(/** new data members stringified **/);
> if (doc.hash !== hash) then flag the document to be ETL'ed (push the
> _id onto an array for the ETL to persist)
> Thoughts?
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb
Well, this would be part of an import process. We have to "sync" these entities from the 3rd party providers to our system... most of the time they will not change, but some will. I don't want to se changed=true for all documents because most will actually not be modified and we dont want to re-run those through the ETL process.
The hashing would happen as part of the sync .. maybe a findAndModify using document hash != new hash. This way I am only modifying those documents that have different computed hashes!?
Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp gets updated and we ETL based on that DateTimeUpdate >= last ETL run timestamp.
On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
> Can't you just do a change = true flag? I mean the MD5 wouldn't be awesome > since you would have to pull out every doc to understand if it needs to to > be ETL'ed (assuming each doc is different and sow ould have its own old md5 > and new md5) which wouldn't be nice at all.
> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
>> We want to use MongoDB to store documents of various 3rd party >> providers but need to hook up an ETL process for warehousing. These >> documents will change infrequently and there will be tons of them (5M >> +). What are some ways that we can "flag" documents that were >> actually modified and need to be ETL'd?
>> I was thinking about maintaining a hash in the document itself, which >> represents an md5() of the actual data members:
>> # actual data members start here >> name: 'foo', >> value: 'bar' >> }
>> # in our import >> hash = md5(/** new data members stringified **/); >> if (doc.hash !== hash) then flag the document to be ETL'ed (push the >> _id onto an array for the ETL to persist)
>> Thoughts?
>> -- >> You received this message because you are subscribed to the Google >> Groups "mongodb-user" group. >> To post to this group, send email to mongodb-user@googlegroups.com >> To unsubscribe from this group, send email to >> mongodb-user+unsubscribe@googlegroups.com >> See also the IRC channel -- freenode.net#mongodb
Ah ok I get it now I thought this was gonna be like a cronjob synchro for a
min. Yea a hash can work there.
"The hashing would happen as part of the sync .. maybe a findAndModify
using document hash != new hash. This way I am only modifying those
documents that have different computed hashes!?"
Took me a bit of time to decide how to reply to that bit and am still
unsure. Maybe you can explain what you mean by that? Maybe better to
explain which part of ETL this is. Are you grabbing from the source to put
into the Warehouse (transform)? This import; what step of the ETL is it or
is it a precursor?
"Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON
UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
gets updated and we ETL based on that DateTimeUpdate >= last ETL run
timestamp."
That sounds doable from what I can see and should work better than the
hashs.
On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
> Well, this would be part of an import process. We have to "sync" these
> entities from the 3rd party providers to our system... most of the time
> they will not change, but some will. I don't want to se changed=true for
> all documents because most will actually not be modified and we dont want
> to re-run those through the ETL process.
> The hashing would happen as part of the sync .. maybe a findAndModify
> using document hash != new hash. This way I am only modifying those
> documents that have different computed hashes!?
> Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON
> UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
> timestamp.
> B
> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
>> Can't you just do a change = true flag? I mean the MD5 wouldn't be
>> awesome since you would have to pull out every doc to understand if it
>> needs to to be ETL'ed (assuming each doc is different and sow ould have its
>> own old md5 and new md5) which wouldn't be nice at all.
>> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
>>> We want to use MongoDB to store documents of various 3rd party
>>> providers but need to hook up an ETL process for warehousing. These
>>> documents will change infrequently and there will be tons of them (5M
>>> +). What are some ways that we can "flag" documents that were
>>> actually modified and need to be ETL'd?
>>> I was thinking about maintaining a hash in the document itself, which
>>> represents an md5() of the actual data members:
>>> # actual data members start here
>>> name: 'foo',
>>> value: 'bar'
>>> }
>>> # in our import
>>> hash = md5(/** new data members stringified **/);
>>> if (doc.hash !== hash) then flag the document to be ETL'ed (push the
>>> _id onto an array for the ETL to persist)
>>> Thoughts?
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "mongodb-user" group.
>>> To post to this group, send email to mongodb-user@googlegroups.com
>>> To unsubscribe from this group, send email to
>>> mongodb-user+unsubscribe@**googlegroups.com<mongodb-user%2Bunsubscribe@goog legroups.com>
>>> See also the IRC channel -- freenode.net#mongodb
>> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb
I guess a simple question at this point is how to emulate the DateTimeUpdated column in MongoDB where it should only update if the document was actually modified.
On Thursday, May 17, 2012 10:08:21 AM UTC-4, Sammaye wrote:
> Ah ok I get it now I thought this was gonna be like a cronjob synchro for > a min. Yea a hash can work there.
> "The hashing would happen as part of the sync .. maybe a findAndModify > using document hash != new hash. This way I am only modifying those > documents that have different computed hashes!?"
> Took me a bit of time to decide how to reply to that bit and am still > unsure. Maybe you can explain what you mean by that? Maybe better to > explain which part of ETL this is. Are you grabbing from the source to put > into the Warehouse (transform)? This import; what step of the ETL is it or > is it a precursor?
> "Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON > UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp > gets updated and we ETL based on that DateTimeUpdate >= last ETL run > timestamp."
> That sounds doable from what I can see and should work better than the > hashs.
> On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
>> Well, this would be part of an import process. We have to "sync" these >> entities from the 3rd party providers to our system... most of the time >> they will not change, but some will. I don't want to se changed=true for >> all documents because most will actually not be modified and we dont want >> to re-run those through the ETL process.
>> The hashing would happen as part of the sync .. maybe a findAndModify >> using document hash != new hash. This way I am only modifying those >> documents that have different computed hashes!?
>> Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON >> UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp >> gets updated and we ETL based on that DateTimeUpdate >= last ETL run >> timestamp.
>> B
>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
>>> Can't you just do a change = true flag? I mean the MD5 wouldn't be >>> awesome since you would have to pull out every doc to understand if it >>> needs to to be ETL'ed (assuming each doc is different and sow ould have its >>> own old md5 and new md5) which wouldn't be nice at all.
>>> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
>>>> We want to use MongoDB to store documents of various 3rd party >>>> providers but need to hook up an ETL process for warehousing. These >>>> documents will change infrequently and there will be tons of them (5M >>>> +). What are some ways that we can "flag" documents that were >>>> actually modified and need to be ETL'd?
>>>> I was thinking about maintaining a hash in the document itself, which >>>> represents an md5() of the actual data members:
>>>> # actual data members start here >>>> name: 'foo', >>>> value: 'bar' >>>> }
>>>> # in our import >>>> hash = md5(/** new data members stringified **/); >>>> if (doc.hash !== hash) then flag the document to be ETL'ed (push the >>>> _id onto an array for the ETL to persist)
>>>> Thoughts?
>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "mongodb-user" group. >>>> To post to this group, send email to mongodb-user@googlegroups.com >>>> To unsubscribe from this group, send email to >>>> mongodb-user+unsubscribe@**googlegroups.com<mongodb-user%2Bunsubscribe@goog legroups.com> >>>> See also the IRC channel -- freenode.net#mongodb
>>> -- >> You received this message because you are subscribed to the Google >> Groups "mongodb-user" group. >> To post to this group, send email to mongodb-user@googlegroups.com >> To unsubscribe from this group, send email to >> mongodb-user+unsubscribe@googlegroups.com >> See also the IRC channel -- freenode.net#mongodb
Well this would be ok with ActiveRecord since ActiveRecords standard is to
detect changes and update on changes.
But another easy way is to judge when values (through using objects, like
in PHP) change. In PHP you can use the __set __get magics, you could make
make a wrapper you could also when you load the record into the object
store an old version of it self and then on save compare the two, there are
a couple of other ways. When you of course judge a change you set the flag
On 17 May 2012 18:11, Brandon Parise <bpar...@gmail.com> wrote:
> I guess a simple question at this point is how to emulate the
> DateTimeUpdated column in MongoDB where it should only update if the
> document was actually modified.
> the second time this would run shouldn't update the document because its
> the same values.
> I would like to either set $item['etl'] = true or something similar...
> B
> On Thursday, May 17, 2012 10:08:21 AM UTC-4, Sammaye wrote:
>> Ah ok I get it now I thought this was gonna be like a cronjob synchro for
>> a min. Yea a hash can work there.
>> "The hashing would happen as part of the sync .. maybe a findAndModify
>> using document hash != new hash. This way I am only modifying those
>> documents that have different computed hashes!?"
>> Took me a bit of time to decide how to reply to that bit and am still
>> unsure. Maybe you can explain what you mean by that? Maybe better to
>> explain which part of ETL this is. Are you grabbing from the source to put
>> into the Warehouse (transform)? This import; what step of the ETL is it or
>> is it a precursor?
>> "Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON
>> UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
>> timestamp."
>> That sounds doable from what I can see and should work better than the
>> hashs.
>> On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
>>> Well, this would be part of an import process. We have to "sync" these
>>> entities from the 3rd party providers to our system... most of the time
>>> they will not change, but some will. I don't want to se changed=true for
>>> all documents because most will actually not be modified and we dont want
>>> to re-run those through the ETL process.
>>> The hashing would happen as part of the sync .. maybe a findAndModify
>>> using document hash != new hash. This way I am only modifying those
>>> documents that have different computed hashes!?
>>> Up to now we have been using mysql and leveraging a `DateTimeUpdated` ON
>>> UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
>>> timestamp.
>>> B
>>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
>>>> Can't you just do a change = true flag? I mean the MD5 wouldn't be
>>>> awesome since you would have to pull out every doc to understand if it
>>>> needs to to be ETL'ed (assuming each doc is different and sow ould have its
>>>> own old md5 and new md5) which wouldn't be nice at all.
>>>> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
>>>>> We want to use MongoDB to store documents of various 3rd party
>>>>> providers but need to hook up an ETL process for warehousing. These
>>>>> documents will change infrequently and there will be tons of them (5M
>>>>> +). What are some ways that we can "flag" documents that were
>>>>> actually modified and need to be ETL'd?
>>>>> I was thinking about maintaining a hash in the document itself, which
>>>>> represents an md5() of the actual data members:
>>>>> # actual data members start here
>>>>> name: 'foo',
>>>>> value: 'bar'
>>>>> }
>>>>> # in our import
>>>>> hash = md5(/** new data members stringified **/);
>>>>> if (doc.hash !== hash) then flag the document to be ETL'ed (push the
>>>>> _id onto an array for the ETL to persist)
>>>>> Thoughts?
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "mongodb-user" group.
>>>>> To post to this group, send email to mongodb-user@googlegroups.com
>>>>> To unsubscribe from this group, send email to
>>>>> mongodb-user+unsubscribe@**googl**egroups.com<mongodb-user%2Bunsubscribe@go oglegroups.com>
>>>>> See also the IRC channel -- freenode.net#mongodb
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "mongodb-user" group.
>>> To post to this group, send email to mongodb-user@googlegroups.com
>>> To unsubscribe from this group, send email to
>>> mongodb-user+unsubscribe@**googlegroups.com<mongodb-user%2Bunsubscribe@goog legroups.com>
>>> See also the IRC channel -- freenode.net#mongodb
>> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb
// if the old hash doesn't match the new one then the entity has been updated if ($result->hash != $entity['hash']) { // @todo push this document onto a stack to be ETL'ed
On Thursday, May 17, 2012 1:29:13 PM UTC-4, Sammaye wrote:
> Well this would be ok with ActiveRecord since ActiveRecords standard is to > detect changes and update on changes.
> But another easy way is to judge when values (through using objects, like > in PHP) change. In PHP you can use the __set __get magics, you could make > make a wrapper you could also when you load the record into the object > store an old version of it self and then on save compare the two, there are > a couple of other ways. When you of course judge a change you set the flag
> On 17 May 2012 18:11, Brandon Parise <bpar...@gmail.com> wrote:
>> I guess a simple question at this point is how to emulate the >> DateTimeUpdated column in MongoDB where it should only update if the >> document was actually modified.
>> the second time this would run shouldn't update the document because its >> the same values.
>> I would like to either set $item['etl'] = true or something similar...
>> B
>> On Thursday, May 17, 2012 10:08:21 AM UTC-4, Sammaye wrote:
>>> Ah ok I get it now I thought this was gonna be like a cronjob synchro >>> for a min. Yea a hash can work there.
>>> "The hashing would happen as part of the sync .. maybe a findAndModify >>> using document hash != new hash. This way I am only modifying those >>> documents that have different computed hashes!?"
>>> Took me a bit of time to decide how to reply to that bit and am still >>> unsure. Maybe you can explain what you mean by that? Maybe better to >>> explain which part of ETL this is. Are you grabbing from the source to put >>> into the Warehouse (transform)? This import; what step of the ETL is it or >>> is it a precursor?
>>> "Up to now we have been using mysql and leveraging a `DateTimeUpdated` >>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp >>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run >>> timestamp."
>>> That sounds doable from what I can see and should work better than the >>> hashs.
>>> On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
>>>> Well, this would be part of an import process. We have to "sync" these >>>> entities from the 3rd party providers to our system... most of the time >>>> they will not change, but some will. I don't want to se changed=true for >>>> all documents because most will actually not be modified and we dont want >>>> to re-run those through the ETL process.
>>>> The hashing would happen as part of the sync .. maybe a findAndModify >>>> using document hash != new hash. This way I am only modifying those >>>> documents that have different computed hashes!?
>>>> Up to now we have been using mysql and leveraging a `DateTimeUpdated` >>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp >>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run >>>> timestamp.
>>>> B
>>>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
>>>>> Can't you just do a change = true flag? I mean the MD5 wouldn't be >>>>> awesome since you would have to pull out every doc to understand if it >>>>> needs to to be ETL'ed (assuming each doc is different and sow ould have its >>>>> own old md5 and new md5) which wouldn't be nice at all.
>>>>> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
>>>>>> We want to use MongoDB to store documents of various 3rd party >>>>>> providers but need to hook up an ETL process for warehousing. These >>>>>> documents will change infrequently and there will be tons of them (5M >>>>>> +). What are some ways that we can "flag" documents that were >>>>>> actually modified and need to be ETL'd?
>>>>>> I was thinking about maintaining a hash in the document itself, which >>>>>> represents an md5() of the actual data members:
>>>>>> # actual data members start here >>>>>> name: 'foo', >>>>>> value: 'bar' >>>>>> }
>>>>>> # in our import >>>>>> hash = md5(/** new data members stringified **/); >>>>>> if (doc.hash !== hash) then flag the document to be ETL'ed (push the >>>>>> _id onto an array for the ETL to persist)
>>>>>> Thoughts?
>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "mongodb-user" group. >>>>>> To post to this group, send email to mongodb-user@googlegroups.com >>>>>> To unsubscribe from this group, send email to >>>>>> mongodb-user+unsubscribe@**googl**egroups.com<mongodb-user%2Bunsubscribe@go oglegroups.com> >>>>>> See also the IRC channel -- freenode.net#mongodb
>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "mongodb-user" group. >>>> To post to this group, send email to mongodb-user@googlegroups.com >>>> To unsubscribe from this group, send email to >>>> mongodb-user+unsubscribe@**googlegroups.com<mongodb-user%2Bunsubscribe@goog legroups.com> >>>> See also the IRC channel -- freenode.net#mongodb
>>> -- >> You received this message because you are subscribed to the Google >> Groups "mongodb-user" group. >> To post to this group, send email to mongodb-user@googlegroups.com >> To unsubscribe from this group, send email to >> mongodb-user+unsubscribe@googlegroups.com >> See also the IRC channel -- freenode.net#mongodb
> // if the old hash doesn't match the new one then the entity has been
> updated
> if ($result->hash != $entity['hash']) {
> // @todo push this document onto a stack to be ETL'ed
> }
> ?>
> On Thursday, May 17, 2012 1:29:13 PM UTC-4, Sammaye wrote:
>> Well this would be ok with ActiveRecord since ActiveRecords standard is
>> to detect changes and update on changes.
>> But another easy way is to judge when values (through using objects, like
>> in PHP) change. In PHP you can use the __set __get magics, you could make
>> make a wrapper you could also when you load the record into the object
>> store an old version of it self and then on save compare the two, there are
>> a couple of other ways. When you of course judge a change you set the flag
>> On 17 May 2012 18:11, Brandon Parise <bpar...@gmail.com> wrote:
>>> I guess a simple question at this point is how to emulate the
>>> DateTimeUpdated column in MongoDB where it should only update if the
>>> document was actually modified.
>>> the second time this would run shouldn't update the document because its
>>> the same values.
>>> I would like to either set $item['etl'] = true or something similar...
>>> B
>>> On Thursday, May 17, 2012 10:08:21 AM UTC-4, Sammaye wrote:
>>>> Ah ok I get it now I thought this was gonna be like a cronjob synchro
>>>> for a min. Yea a hash can work there.
>>>> "The hashing would happen as part of the sync .. maybe a findAndModify
>>>> using document hash != new hash. This way I am only modifying those
>>>> documents that have different computed hashes!?"
>>>> Took me a bit of time to decide how to reply to that bit and am still
>>>> unsure. Maybe you can explain what you mean by that? Maybe better to
>>>> explain which part of ETL this is. Are you grabbing from the source to put
>>>> into the Warehouse (transform)? This import; what step of the ETL is it or
>>>> is it a precursor?
>>>> "Up to now we have been using mysql and leveraging a `DateTimeUpdated`
>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
>>>> timestamp."
>>>> That sounds doable from what I can see and should work better than the
>>>> hashs.
>>>> On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
>>>>> Well, this would be part of an import process. We have to "sync"
>>>>> these entities from the 3rd party providers to our system... most of the
>>>>> time they will not change, but some will. I don't want to se changed=true
>>>>> for all documents because most will actually not be modified and we dont
>>>>> want to re-run those through the ETL process.
>>>>> The hashing would happen as part of the sync .. maybe a findAndModify
>>>>> using document hash != new hash. This way I am only modifying those
>>>>> documents that have different computed hashes!?
>>>>> Up to now we have been using mysql and leveraging a `DateTimeUpdated`
>>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
>>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
>>>>> timestamp.
>>>>> B
>>>>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
>>>>>> Can't you just do a change = true flag? I mean the MD5 wouldn't be
>>>>>> awesome since you would have to pull out every doc to understand if it
>>>>>> needs to to be ETL'ed (assuming each doc is different and sow ould have its
>>>>>> own old md5 and new md5) which wouldn't be nice at all.
>>>>>> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
>>>>>>> We want to use MongoDB to store documents of various 3rd party
>>>>>>> providers but need to hook up an ETL process for warehousing. These
>>>>>>> documents will change infrequently and there will be tons of them (5M
>>>>>>> +). What are some ways that we can "flag" documents that were
>>>>>>> actually modified and need to be ETL'd?
>>>>>>> I was thinking about maintaining a hash in the document itself, which
>>>>>>> represents an md5() of the actual data members:
>>>>>>> # actual data members start here
>>>>>>> name: 'foo',
>>>>>>> value: 'bar'
>>>>>>> }
>>>>>>> # in our import
>>>>>>> hash = md5(/** new data members stringified **/);
>>>>>>> if (doc.hash !== hash) then flag the document to be ETL'ed (push the
>>>>>>> _id onto an array for the ETL to persist)
>>>>>>> Thoughts?
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "mongodb-user" group.
>>>>>>> To post to this group, send email to mongodb-user@googlegroups.com
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> mongodb-user+unsubscribe@**googl****egroups.com<mongodb-user%2Bunsubscribe@ googlegroups.com>
>>>>>>> See also the IRC channel -- freenode.net#mongodb
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "mongodb-user" group.
>>>>> To post to this group, send email to mongodb-user@googlegroups.com
>>>>> To unsubscribe from this group, send email to
>>>>> mongodb-user+unsubscribe@**googl**egroups.com<mongodb-user%2Bunsubscribe@go oglegroups.com>
>>>>> See also the IRC channel -- freenode.net#mongodb
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "mongodb-user" group.
>>> To post to this group, send email to mongodb-user@googlegroups.com
>>> To unsubscribe from this group, send email to
>>> mongodb-user+unsubscribe@**googlegroups.com<mongodb-user%2Bunsubscribe@goog legroups.com>
>>> See also the IRC channel -- freenode.net#mongodb
>> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel -- freenode.net#mongodb
I guess I should've started off with that :) Been trying to find out how others ETL from MongoDB and most approach it as they would rather take the overhead of re-ETL'ing the same document (which would effectively have no change anyways) than have a complex "middleman" to manage the "needs etl'ed state'. But, in our case we will have millions of documents so we want to keep the process as lean as possible.
This was an interesting dive and hopefully someone who searches this group finds this thread useful.
>> // if the old hash doesn't match the new one then the entity has been >> updated >> if ($result->hash != $entity['hash']) { >> // @todo push this document onto a stack to be ETL'ed >> } >> ?>
>> On Thursday, May 17, 2012 1:29:13 PM UTC-4, Sammaye wrote:
>>> Well this would be ok with ActiveRecord since ActiveRecords standard is >>> to detect changes and update on changes.
>>> But another easy way is to judge when values (through using objects, >>> like in PHP) change. In PHP you can use the __set __get magics, you could >>> make make a wrapper you could also when you load the record into the object >>> store an old version of it self and then on save compare the two, there are >>> a couple of other ways. When you of course judge a change you set the flag
>>> On 17 May 2012 18:11, Brandon Parise <bpar...@gmail.com> wrote:
>>>> I guess a simple question at this point is how to emulate the >>>> DateTimeUpdated column in MongoDB where it should only update if the >>>> document was actually modified.
>>>> the second time this would run shouldn't update the document because >>>> its the same values.
>>>> I would like to either set $item['etl'] = true or something similar...
>>>> B
>>>> On Thursday, May 17, 2012 10:08:21 AM UTC-4, Sammaye wrote:
>>>>> Ah ok I get it now I thought this was gonna be like a cronjob synchro >>>>> for a min. Yea a hash can work there.
>>>>> "The hashing would happen as part of the sync .. maybe a findAndModify >>>>> using document hash != new hash. This way I am only modifying those >>>>> documents that have different computed hashes!?"
>>>>> Took me a bit of time to decide how to reply to that bit and am still >>>>> unsure. Maybe you can explain what you mean by that? Maybe better to >>>>> explain which part of ETL this is. Are you grabbing from the source to put >>>>> into the Warehouse (transform)? This import; what step of the ETL is it or >>>>> is it a precursor?
>>>>> "Up to now we have been using mysql and leveraging a `DateTimeUpdated` >>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp >>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run >>>>> timestamp."
>>>>> That sounds doable from what I can see and should work better than the >>>>> hashs.
>>>>> On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
>>>>>> Well, this would be part of an import process. We have to "sync" >>>>>> these entities from the 3rd party providers to our system... most of the >>>>>> time they will not change, but some will. I don't want to se changed=true >>>>>> for all documents because most will actually not be modified and we dont >>>>>> want to re-run those through the ETL process.
>>>>>> The hashing would happen as part of the sync .. maybe a findAndModify >>>>>> using document hash != new hash. This way I am only modifying those >>>>>> documents that have different computed hashes!?
>>>>>> Up to now we have been using mysql and leveraging a `DateTimeUpdated` >>>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp >>>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run >>>>>> timestamp.
>>>>>> B
>>>>>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
>>>>>>> Can't you just do a change = true flag? I mean the MD5 wouldn't be >>>>>>> awesome since you would have to pull out every doc to understand if it >>>>>>> needs to to be ETL'ed (assuming each doc is different and sow ould have its >>>>>>> own old md5 and new md5) which wouldn't be nice at all.
>>>>>>> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
>>>>>>>> We want to use MongoDB to store documents of various 3rd party >>>>>>>> providers but need to hook up an ETL process for warehousing. These >>>>>>>> documents will change infrequently and there will be tons of them >>>>>>>> (5M >>>>>>>> +). What are some ways that we can "flag" documents that were >>>>>>>> actually modified and need to be ETL'd?
>>>>>>>> I was thinking about maintaining a hash in the document itself, >>>>>>>> which >>>>>>>> represents an md5() of the actual data members:
>>>>>>>> # actual data members start here >>>>>>>> name: 'foo', >>>>>>>> value: 'bar' >>>>>>>> }
>>>>>>>> # in our import >>>>>>>> hash = md5(/** new data members stringified **/); >>>>>>>> if (doc.hash !== hash) then flag the document to be ETL'ed (push the >>>>>>>> _id onto an array for the ETL to persist)
>>>>>>>> Thoughts?
>>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "mongodb-user" group. >>>>>>>> To post to this group, send email to mongodb-user@googlegroups.com >>>>>>>> To unsubscribe from this group, send email to >>>>>>>> mongodb-user+unsubscribe@**googl****egroups.com<mongodb-user%2Bunsubscribe@ googlegroups.com> >>>>>>>> See also the IRC channel -- freenode.net#mongodb
>>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "mongodb-user" group. >>>>>> To post to this group, send email to mongodb-user@googlegroups.com >>>>>> To unsubscribe from this group, send email to >>>>>> mongodb-user+unsubscribe@**googl**egroups.com<mongodb-user%2Bunsubscribe@go oglegroups.com> >>>>>> See also the IRC channel -- freenode.net#mongodb
>>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "mongodb-user" group. >>>> To post to this group, send email to mongodb-user@googlegroups.com >>>> To unsubscribe from this group, send email to >>>> mongodb-user+unsubscribe@**googlegroups.com<mongodb-user%2Bunsubscribe@goog legroups.com> >>>> See also the IRC channel -- freenode.net#mongodb
>>> -- >> You received this message because you are subscribed to the Google >> Groups "mongodb-user" group. >> To post to this group, send email to mongodb-user@googlegroups.com >> To unsubscribe from this group, send email to >> mongodb-user+unsubscribe@googlegroups.com >> See also the IRC channel -- freenode.net#mongodb
I can suggest you to keep last update time and index it. Usually you
know when you change data and for what reason and if updated dataset
is small it would be very fast to check, you can combine this approach
with labels. I have ETL with collection with 2M of docs this works
very fast.
Regards,
Valeriy
On 22 май, 19:14, Brandon Parise <bpar...@gmail.com> wrote:
> I guess I should've started off with that :) Been trying to find out how
> others ETL from MongoDB and most approach it as they would rather take the
> overhead of re-ETL'ing the same document (which would effectively have no
> change anyways) than have a complex "middleman" to manage the "needs etl'ed
> state'. But, in our case we will have millions of documents so we want to
> keep the process as lean as possible.
> This was an interesting dive and hopefully someone who searches this group
> finds this thread useful.
> Thanks Sammaye!
> B
> On Tuesday, May 22, 2012 9:46:47 AM UTC-4, Sammaye wrote:
> > Aye that looks good now that you've explained your logic :)
> > On 22 May 2012 14:44, Brandon Parise <bpar...@gmail.com> wrote:
> >> Sorry for taking so long to respond!! I don't want to have to pull the
> >> down the document in order to maintain the "dirty" state of the document.
> >> // if the old hash doesn't match the new one then the entity has been
> >> updated
> >> if ($result->hash != $entity['hash']) {
> >> // @todo push this document onto a stack to be ETL'ed
> >> }
> >> ?>
> >> On Thursday, May 17, 2012 1:29:13 PM UTC-4, Sammaye wrote:
> >>> Well this would be ok with ActiveRecord since ActiveRecords standard is
> >>> to detect changes and update on changes.
> >>> But another easy way is to judge when values (through using objects,
> >>> like in PHP) change. In PHP you can use the __set __get magics, you could
> >>> make make a wrapper you could also when you load the record into the object
> >>> store an old version of it self and then on save compare the two, there are
> >>> a couple of other ways. When you of course judge a change you set the flag
> >>> On 17 May 2012 18:11, Brandon Parise <bpar...@gmail.com> wrote:
> >>>> I guess a simple question at this point is how to emulate the
> >>>> DateTimeUpdated column in MongoDB where it should only update if the
> >>>> document was actually modified.
> >>>> the second time this would run shouldn't update the document because
> >>>> its the same values.
> >>>> I would like to either set $item['etl'] = true or something similar...
> >>>> B
> >>>> On Thursday, May 17, 2012 10:08:21 AM UTC-4, Sammaye wrote:
> >>>>> Ah ok I get it now I thought this was gonna be like a cronjob synchro
> >>>>> for a min. Yea a hash can work there.
> >>>>> "The hashing would happen as part of the sync .. maybe a findAndModify
> >>>>> using document hash != new hash. This way I am only modifying those
> >>>>> documents that have different computed hashes!?"
> >>>>> Took me a bit of time to decide how to reply to that bit and am still
> >>>>> unsure. Maybe you can explain what you mean by that? Maybe better to
> >>>>> explain which part of ETL this is. Are you grabbing from the source to put
> >>>>> into the Warehouse (transform)? This import; what step of the ETL is it or
> >>>>> is it a precursor?
> >>>>> "Up to now we have been using mysql and leveraging a `DateTimeUpdated`
> >>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
> >>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
> >>>>> timestamp."
> >>>>> That sounds doable from what I can see and should work better than the
> >>>>> hashs.
> >>>>> On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
> >>>>>> Well, this would be part of an import process. We have to "sync"
> >>>>>> these entities from the 3rd party providers to our system... most of the
> >>>>>> time they will not change, but some will. I don't want to se changed=true
> >>>>>> for all documents because most will actually not be modified and we dont
> >>>>>> want to re-run those through the ETL process.
> >>>>>> The hashing would happen as part of the sync .. maybe a findAndModify
> >>>>>> using document hash != new hash. This way I am only modifying those
> >>>>>> documents that have different computed hashes!?
> >>>>>> Up to now we have been using mysql and leveraging a `DateTimeUpdated`
> >>>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
> >>>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
> >>>>>> timestamp.
> >>>>>> B
> >>>>>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
> >>>>>>> Can't you just do a change = true flag? I mean the MD5 wouldn't be
> >>>>>>> awesome since you would have to pull out every doc to understand if it
> >>>>>>> needs to to be ETL'ed (assuming each doc is different and sow ould have its
> >>>>>>> own old md5 and new md5) which wouldn't be nice at all.
> >>>>>>> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
> >>>>>>>> We want to use MongoDB to store documents of various 3rd party
> >>>>>>>> providers but need to hook up an ETL process for warehousing. These
> >>>>>>>> documents will change infrequently and there will be tons of them
> >>>>>>>> (5M
> >>>>>>>> +). What are some ways that we can "flag" documents that were
> >>>>>>>> actually modified and need to be ETL'd?
> >>>>>>>> I was thinking about maintaining a hash in the document itself,
> >>>>>>>> which
> >>>>>>>> represents an md5() of the actual data members:
> >>>>>>>> # actual data members start here
> >>>>>>>> name: 'foo',
> >>>>>>>> value: 'bar'
> >>>>>>>> }
> >>>>>>>> # in our import
> >>>>>>>> hash = md5(/** new data members stringified **/);
> >>>>>>>> if (doc.hash !== hash) then flag the document to be ETL'ed (push the
> >>>>>>>> _id onto an array for the ETL to persist)
> >>>>>>>> Thoughts?
> >>>>>>>> --
> >>>>>>>> You received this message because you are subscribed to the Google
> >>>>>>>> Groups "mongodb-user" group.
> >>>>>>>> To post to this group, send email to mongodb-user@googlegroups.com
> >>>>>>>> To unsubscribe from this group, send email to
> >>>>>>>> mongodb-user+unsubscribe@**googl****egroups.com<mongodb-user%2Bunsubscribe@ googlegroups.com>
> >>>>>>>> See also the IRC channel -- freenode.net#mongodb
> >>>>>>> --
> >>>>>> You received this message because you are subscribed to the Google
> >>>>>> Groups "mongodb-user" group.
> >>>>>> To post to this group, send email to mongodb-user@googlegroups.com
> >>>>>> To unsubscribe from this group, send email to
> >>>>>> mongodb-user+unsubscribe@**googl**egroups.com<mongodb-user%2Bunsubscribe@go oglegroups.com>
> >>>>>> See also the IRC channel -- freenode.net#mongodb
> >>>>> --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "mongodb-user" group.
> >>>> To post to this group, send email to mongodb-user@googlegroups.com
> >>>> To unsubscribe from this group, send email to
> >>>> mongodb-user+unsubscribe@**googlegroups.com<mongodb-user%2Bunsubscribe@goog legroups.com>
> >>>> See also the IRC channel -- freenode.net#mongodb
> >>> --
> >> You received this message because you are subscribed to the Google
> >> Groups "mongodb-user" group.
> >> To post to this group, send email to mongodb-user@googlegroups.com
> >> To unsubscribe from this group, send email to
> >> mongodb-user+unsubscribe@googlegroups.com
> >> See also the IRC channel -- freenode.net#mongodb
Yea mongo isn't like SQL, where in SQL you are trying your darn hardest to
ensure you do as few queries and as complete as possible Mongo is
more..streamable on that front.
With an index on hash yours should be that bad but Varela makes a valid
point which is why I said the timestamp would work initially.
On 22 May 2012 18:38, Varela <zmievsko...@gmail.com> wrote:
> I can suggest you to keep last update time and index it. Usually you
> know when you change data and for what reason and if updated dataset
> is small it would be very fast to check, you can combine this approach
> with labels. I have ETL with collection with 2M of docs this works
> very fast.
> Regards,
> Valeriy
> On 22 май, 19:14, Brandon Parise <bpar...@gmail.com> wrote:
> > I guess I should've started off with that :) Been trying to find out how
> > others ETL from MongoDB and most approach it as they would rather take
> the
> > overhead of re-ETL'ing the same document (which would effectively have no
> > change anyways) than have a complex "middleman" to manage the "needs
> etl'ed
> > state'. But, in our case we will have millions of documents so we want
> to
> > keep the process as lean as possible.
> > This was an interesting dive and hopefully someone who searches this
> group
> > finds this thread useful.
> > Thanks Sammaye!
> > B
> > On Tuesday, May 22, 2012 9:46:47 AM UTC-4, Sammaye wrote:
> > > Aye that looks good now that you've explained your logic :)
> > > On 22 May 2012 14:44, Brandon Parise <bpar...@gmail.com> wrote:
> > >> Sorry for taking so long to respond!! I don't want to have to pull
> the
> > >> down the document in order to maintain the "dirty" state of the
> document.
> > >> // if the old hash doesn't match the new one then the entity has been
> > >> updated
> > >> if ($result->hash != $entity['hash']) {
> > >> // @todo push this document onto a stack to be ETL'ed
> > >> }
> > >> ?>
> > >> On Thursday, May 17, 2012 1:29:13 PM UTC-4, Sammaye wrote:
> > >>> Well this would be ok with ActiveRecord since ActiveRecords standard
> is
> > >>> to detect changes and update on changes.
> > >>> But another easy way is to judge when values (through using objects,
> > >>> like in PHP) change. In PHP you can use the __set __get magics, you
> could
> > >>> make make a wrapper you could also when you load the record into the
> object
> > >>> store an old version of it self and then on save compare the two,
> there are
> > >>> a couple of other ways. When you of course judge a change you set
> the flag
> > >>> On 17 May 2012 18:11, Brandon Parise <bpar...@gmail.com> wrote:
> > >>>> I guess a simple question at this point is how to emulate the
> > >>>> DateTimeUpdated column in MongoDB where it should only update if the
> > >>>> document was actually modified.
> > >>>> the second time this would run shouldn't update the document because
> > >>>> its the same values.
> > >>>> I would like to either set $item['etl'] = true or something
> similar...
> > >>>> B
> > >>>> On Thursday, May 17, 2012 10:08:21 AM UTC-4, Sammaye wrote:
> > >>>>> Ah ok I get it now I thought this was gonna be like a cronjob
> synchro
> > >>>>> for a min. Yea a hash can work there.
> > >>>>> "The hashing would happen as part of the sync .. maybe a
> findAndModify
> > >>>>> using document hash != new hash. This way I am only modifying
> those
> > >>>>> documents that have different computed hashes!?"
> > >>>>> Took me a bit of time to decide how to reply to that bit and am
> still
> > >>>>> unsure. Maybe you can explain what you mean by that? Maybe better
> to
> > >>>>> explain which part of ETL this is. Are you grabbing from the
> source to put
> > >>>>> into the Warehouse (transform)? This import; what step of the ETL
> is it or
> > >>>>> is it a precursor?
> > >>>>> "Up to now we have been using mysql and leveraging a
> `DateTimeUpdated`
> > >>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that
> timestamp
> > >>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL
> run
> > >>>>> timestamp."
> > >>>>> That sounds doable from what I can see and should work better than
> the
> > >>>>> hashs.
> > >>>>> On 17 May 2012 14:17, Brandon Parise <bpar...@gmail.com> wrote:
> > >>>>>> Well, this would be part of an import process. We have to "sync"
> > >>>>>> these entities from the 3rd party providers to our system... most
> of the
> > >>>>>> time they will not change, but some will. I don't want to se
> changed=true
> > >>>>>> for all documents because most will actually not be modified and
> we dont
> > >>>>>> want to re-run those through the ETL process.
> > >>>>>> The hashing would happen as part of the sync .. maybe a
> findAndModify
> > >>>>>> using document hash != new hash. This way I am only modifying
> those
> > >>>>>> documents that have different computed hashes!?
> > >>>>>> Up to now we have been using mysql and leveraging a
> `DateTimeUpdated`
> > >>>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that
> timestamp
> > >>>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL
> run
> > >>>>>> timestamp.
> > >>>>>> B
> > >>>>>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
> > >>>>>>> Can't you just do a change = true flag? I mean the MD5 wouldn't
> be
> > >>>>>>> awesome since you would have to pull out every doc to understand
> if it
> > >>>>>>> needs to to be ETL'ed (assuming each doc is different and sow
> ould have its
> > >>>>>>> own old md5 and new md5) which wouldn't be nice at all.
> > >>>>>>> On 17 May 2012 13:48, Brandon Parise <bpar...@gmail.com> wrote:
> > >>>>>>>> We want to use MongoDB to store documents of various 3rd party
> > >>>>>>>> providers but need to hook up an ETL process for warehousing.
> These
> > >>>>>>>> documents will change infrequently and there will be tons of
> them
> > >>>>>>>> (5M
> > >>>>>>>> +). What are some ways that we can "flag" documents that were
> > >>>>>>>> actually modified and need to be ETL'd?
> > >>>>>>>> I was thinking about maintaining a hash in the document itself,
> > >>>>>>>> which
> > >>>>>>>> represents an md5() of the actual data members:
> > >>>>>>>> # actual data members start here
> > >>>>>>>> name: 'foo',
> > >>>>>>>> value: 'bar'
> > >>>>>>>> }
> > >>>>>>>> # in our import
> > >>>>>>>> hash = md5(/** new data members stringified **/);
> > >>>>>>>> if (doc.hash !== hash) then flag the document to be ETL'ed
> (push the
> > >>>>>>>> _id onto an array for the ETL to persist)
> > >>>>>>>> Thoughts?
> > >>>>>>>> --
> > >>>>>>>> You received this message because you are subscribed to the
> Google
> > >>>>>>>> Groups "mongodb-user" group.
> > >>>>>>>> To post to this group, send email to
> mongodb-user@googlegroups.com
> > >>>>>>>> To unsubscribe from this group, send email to
> > >>>>>>>> mongodb-user+unsubscribe@**googl****egroups.com
> <mongodb-user%2Bunsubscribe@ googlegroups.com>
> > >>>>>>>> See also the IRC channel -- freenode.net#mongodb
> > >>>>>>> --
> > >>>>>> You received this message because you are subscribed to the Google
> > >>>>>> Groups "mongodb-user" group.
> > >>>>>> To post to this group, send email to
> mongodb-user@googlegroups.com
> > >>>>>> To unsubscribe from this group, send email to
> > >>>>>> mongodb-user+unsubscribe@**googl**egroups.com
> <mongodb-user%2Bunsubscribe@go oglegroups.com>
> > >>>>>> See also the IRC channel -- freenode.net#mongodb
> > >>>>> --
> > >>>> You received this message because you are subscribed to the Google
> > >>>> Groups "mongodb-user" group.
> > >>>> To post to this group, send email to mongodb-user@googlegroups.com
> > >>>> To unsubscribe from this group, send email to
> > >>>> mongodb-user+unsubscribe@**googlegroups.com
> <mongodb-user%2Bunsubscribe@goog legroups.com>
> > >>>> See also the IRC channel -- freenode.net#mongodb
> > >>> --
> > >> You received this message because you are subscribed to the Google
> > >> Groups "mongodb-user" group.
> > >> To post to this group, send email to mongodb-user@googlegroups.com
> > >> To unsubscribe from this group, send email to
> > >> mongodb-user+unsubscribe@googlegroups.com
> > >> See also the IRC channel -- freenode.net#mongodb
> --
> You received this message because you are subscribed to the Google
> Groups "mongodb-user" group.
> To post to this group, send email to mongodb-user@googlegroups.com
> To unsubscribe from this group, send email to
> mongodb-user+unsubscribe@googlegroups.com
> See also the IRC channel --