app-engine-data-pipelines session video

71 views
Skip to first unread message

Jaroslav Záruba

unread,
Jun 6, 2010, 10:02:42 PM6/6/10
to Google App Engine
Hello

I'm reading through the PDF that Brett Slatkin has published for %subj
%.
http://tinyurl.com/3523mej

In the video (the Fan-in part) Brett says that the work_index has to
be a hash, so that 'you distribute the load across the BigTable'
http://www.youtube.com/watch?v=zSDC_TU7rtc#t=48m44

And this is how work_index is created:
work_index = '%s-%d' % (sum_name, knuth_hash(index))
...which I guess creates something like 'votesMovieXYZ-54657651321987'

My question is why only one half of work_index is hashed? Is it
important?
Would it be bad to do md5('%s-%d' % (sum_name, index)) so that the
hash would be like '6gw8....hq6'?

Regards
J. Záruba

Jaroslav Záruba

unread,
Jun 7, 2010, 3:40:49 AM6/7/10
to Google App Engine
Also if someone knew what is the purpose of "now / 30" in the task name, please:
http://www.youtube.com/watch?v=zSDC_TU7rtc#t=41m35

Regards
  J. Záruba

2010/6/7 Jaroslav Záruba <jarosla...@gmail.com>

Tristan

unread,
Jun 7, 2010, 5:14:14 PM6/7/10
to Google App Engine
not a python guy but, the purpose of int (now / 30) will be to come up
with the same name for a span of time (30 milliseconds?).

notice that int(1/30) = 0 int (3/30) = 0 int (29/30) = 0 and
int(32/30) = 1. this is a way to come up with that task name
uniquely.

although now i'm confused because doesn't he say later on that time is
a bad thing to use for synchronization and sequence numbers should be
used instead?

On Jun 7, 2:40 am, Jaroslav Záruba <jaroslav.zar...@gmail.com> wrote:
> Also if someone knew what is the purpose of "now / 30" in the task name,
> please:http://www.youtube.com/watch?v=zSDC_TU7rtc#t=41m35
>
> Regards
>   J. Záruba
>
> 2010/6/7 Jaroslav Záruba <jaroslav.zar...@gmail.com>

Brett Slatkin

unread,
Jun 7, 2010, 5:21:09 PM6/7/10
to google-a...@googlegroups.com
Hey all,

The int(time.time()/30) part of the task name is to prevent queue stalls. When memcache gets evicted the work index counter will be reset to zero. That means new fork-join work items may insert tasks that are named the same as tasks that were already inserted. By including a time window of ~30 seconds in the task name, we ensure that this problem can only last for about thirty seconds. This is also why you should raise an exception when you see a TombstonedTaskError exception.

Worst-case scenario if the clocks are wonky is that two tasks are run to do the fan-in work instead of just one, which is an acceptable trade-off in many cases and a fundamental possibility when using the task queue API. This can be mitigated using pigeon-hole acknowledgment entities, like I use in my materialized view example.

Hope that helps,

-Brett


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.


Jaroslav Záruba

unread,
Jun 7, 2010, 5:38:54 PM6/7/10
to google-a...@googlegroups.com
Thank you, Brett.

Would it be wrong to hash whole work_index instead of only hashing its second half? sum_name, knuth_hash(index)
By md5-ing only the sequence number I get work_index of 'mySumName' + 32B. If I hashed mySumName together with the seq.number the key would be only 32B. (Still quite huge though.)
Given how frequent a vote entity is I would like to have the keys as short as possible.

Regards
  J. Záruba

Brett Slatkin

unread,
Jun 7, 2010, 5:44:58 PM6/7/10
to google-a...@googlegroups.com
I was using an integer hash to reduce the key size. You don't need to hash the whole thing. Bigtable will split tablets based on a string prefix, so all that matters is the data distribution beyond that prefix. So "foo-<hash>" is just as effective as "<hash of foo + number>", or even better since it's shorter.

2010/6/7 Jaroslav Záruba <jarosla...@gmail.com>

Jaroslav Záruba

unread,
Jun 7, 2010, 5:49:40 PM6/7/10
to google-a...@googlegroups.com
Thanks a lot!

james lesorg

unread,
Jun 14, 2010, 3:56:51 AM6/14/10
to Google App Engine
Brett, any plans to make an article of this talk?

This feels like such a key strategy to getting stuff done on
datastore, it should be part of the sdk.



On Jun 7, 11:44 pm, Brett Slatkin <brett-appeng...@google.com> wrote:
> I was using an integer hash to reduce the key size. You don't need to hash
> the whole thing. Bigtable will split tablets based on a string prefix, so
> all that matters is the data distribution beyond that prefix. So
> "foo-<hash>" is just as effective as "<hash of foo + number>", or even
> better since it's shorter.
>
> 2010/6/7 Jaroslav Záruba <jaroslav.zar...@gmail.com>
>
>
>
> > Thank you, Brett.
>
> > Would it be wrong to hash whole work_index instead of only hashing its
> > second half? sum_name, knuth_hash(index)
> > By md5-ing only the sequence number I get work_index of 'mySumName' + 32B.
> > If I hashed mySumName together with the seq.number the key would be only
> > 32B. (Still quite huge though.)
> > Given how frequent a vote entity is I would like to have the keys as short
> > as possible.
>
> > Regards
> >   J. Záruba
>
> > On Mon, Jun 7, 2010 at 11:21 PM, Brett Slatkin <brett-appeng...@google.com
> > > wrote:
>
> >> Hey all,
>
> >> The int(time.time()/30) part of the task name is to prevent queue stalls.
> >> When memcache gets evicted the work index counter will be reset to zero.
> >> That means new fork-join work items may insert tasks that are named the same
> >> as tasks that were already inserted. By including a time window of ~30
> >> seconds in the task name, we ensure that this problem can only last for
> >> about thirty seconds. This is also why you should raise an exception when
> >> you see a TombstonedTaskError exception.
>
> >> Worst-case scenario if the clocks are wonky is that two tasks are run to
> >> do the fan-in work instead of just one, which is an acceptable trade-off in
> >> many cases and a fundamental possibility when using the task queue API. This
> >> can be mitigated using pigeon-hole acknowledgment entities, like I use in my
> >> materialized view example.
>
> >> Hope that helps,
>
> >> -Brett
>
> >>> google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>
> >>> .
> >>> For more options, visit this group at
> >>>http://groups.google.com/group/google-appengine?hl=en.
>
> >>  --
> >> You received this message because you are subscribed to the Google Groups
> >> "Google App Engine" group.
> >> To post to this group, send email to google-a...@googlegroups.com.
> >> To unsubscribe from this group, send email to
> >> google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>
> >> .
> >> For more options, visit this group at
> >>http://groups.google.com/group/google-appengine?hl=en.
>
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine" group.
> > To post to this group, send email to google-a...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>
> > .
Reply all
Reply to author
Forward
0 new messages