Hi,
I can suggest you to keep last update time and index it. Usually you
know when you change data and for what reason and if updated dataset
is small it would be very fast to check, you can combine this approach
with labels. I have ETL with collection with 2M of docs this works
very fast.
Regards,
Valeriy
On 22 май, 19:14, Brandon Parise <
bpar...@gmail.com> wrote:
> I guess I should've started off with that :) Been trying to find out how
> others ETL from MongoDB and most approach it as they would rather take the
> overhead of re-ETL'ing the same document (which would effectively have no
> change anyways) than have a complex "middleman" to manage the "needs etl'ed
> state'. But, in our case we will have millions of documents so we want to
> keep the process as lean as possible.
>
> This was an interesting dive and hopefully someone who searches this group
> finds this thread useful.
>
> Thanks Sammaye!
> B
>
>
>
>
>
>
>
> On Tuesday, May 22, 2012 9:46:47 AM UTC-4, Sammaye wrote:
>
> > Aye that looks good now that you've explained your logic :)
>
> >>>>> On 17 May 2012 14:17, Brandon Parise <
bpar...@gmail.com> wrote:
>
> >>>>>> Well, this would be part of an import process. We have to "sync"
> >>>>>> these entities from the 3rd party providers to our system... most of the
> >>>>>> time they will not change, but some will. I don't want to se changed=true
> >>>>>> for all documents because most will actually not be modified and we dont
> >>>>>> want to re-run those through the ETL process.
>
> >>>>>> The hashing would happen as part of the sync .. maybe a findAndModify
> >>>>>> using document hash != new hash. This way I am only modifying those
> >>>>>> documents that have different computed hashes!?
>
> >>>>>> Up to now we have been using mysql and leveraging a `DateTimeUpdated`
> >>>>>> ON UPDATE CURRENT_TIMESTAMP column so if any column changed that timestamp
> >>>>>> gets updated and we ETL based on that DateTimeUpdate >= last ETL run
> >>>>>> timestamp.
>
> >>>>>> B
>
> >>>>>> On Thursday, May 17, 2012 8:53:45 AM UTC-4, Sammaye wrote:
>
> >>>>>>> Can't you just do a change = true flag? I mean the MD5 wouldn't be
> >>>>>>> awesome since you would have to pull out every doc to understand if it
> >>>>>>> needs to to be ETL'ed (assuming each doc is different and sow ould have its
> >>>>>>> own old md5 and new md5) which wouldn't be nice at all.
>
> >>>>>>>> mongodb-user+unsubscribe@**googl****
egroups.com<mongodb-user%2Bunsubscribe@
googlegroups.com>
> >>>>>>>> See also the IRC channel --
freenode.net#mongodb
>
> >>>>>>> --
> >>>>>> You received this message because you are subscribed to the Google
> >>>>>> Groups "mongodb-user" group.
> >>>>>> To post to this group, send email to
mongod...@googlegroups.com
> >>>>>> To unsubscribe from this group, send email to
> >>>>>> mongodb-user+unsubscribe@**googl**
egroups.com<mongodb-user%2Bunsubscribe@go
oglegroups.com>
> >>>>>> See also the IRC channel --
freenode.net#mongodb
>
> >>>>> --
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "mongodb-user" group.
> >>>> To post to this group, send email to
mongod...@googlegroups.com
> >>>> To unsubscribe from this group, send email to
> >>>> mongodb-user+unsubscribe@**
googlegroups.com<mongodb-user%2Bunsubscribe@goog
legroups.com>
> >>>> See also the IRC channel --
freenode.net#mongodb
>
> >>> --
> >> You received this message because you are subscribed to the Google
> >> Groups "mongodb-user" group.
> >> To post to this group, send email to
mongod...@googlegroups.com
> >> To unsubscribe from this group, send email to
> >>
mongodb-user...@googlegroups.com