dealing with data migrations/heavylifting

7 views
Skip to first unread message

Jeremie Miller

unread,
Jan 29, 2012, 2:46:47 AM1/29/12
to singl...@googlegroups.com
There are times where a bunch of data work has to be done, such processing the full history of tweets, checkins, or any of the larger connector datasets.  Often this is when a new collection is being introduced, such as I'm facing in trying to get the Timeline one ready, and when it first turns on it needs to go back and index a *lot* of data, but it doesn't have to do it all at once or immediately necessarily either.

So, I've been thinking that it might be time for a mechanism to let services do this kind of "background" work.  The simplest thing I can think of would be to have two new events from core, work://me/#stop and #start, that can be dynamically listened to when there's background work needing to be done.  The service can't start work until the #start event, and must stop upon receiving a #stop and wait again until a #start.

Core could for now very rudimentarily just use os.loadavg() and a configure'd threshold to generate the events, that way the background work would always quiet itself on any system that was busy.

I'd really like to get the new timeline collection merged in soon and am kinda blocking on resolving this (or alternatively forcing all existing lockers through some up-front headaches to update), so any/all thoughts are welcome, and barring any blockers I'll try implementing it in the next day or two :)

Thanks!

Jer

Forrest Norvell

unread,
Jan 29, 2012, 6:54:24 PM1/29/12
to singl...@googlegroups.com
Have you thought about throttling the number of events being processed when load gets high? My concern with anything that just stops and starts execution threads dead is that it will lead to a bunch execution pattern where lots of things end up starting and stopping at once. It seems like introducing a delay between the handling of individual events in a stream would give us a much smoother looking execution profile.

F

Jeremie Miller

unread,
Jan 29, 2012, 7:38:24 PM1/29/12
to singl...@googlegroups.com, singl...@googlegroups.com
I wondered about the same thing, if unhealthy waves could form as that can happen very accidentally with this patten in a server env.

It's really easy to granularly slow down the processing with node, but it's more subtle to signal at what rate is good, any suggestions here are welcome!

If these events happen about as fast as the 1-min load average is calculated (10sec?) it might be as close as we can get to throttling too.

Jer

Forrest Norvell

unread,
Jan 29, 2012, 7:46:30 PM1/29/12
to singl...@googlegroups.com
On Sun, Jan 29, 2012 at 4:38 PM, Jeremie Miller <j...@singly.com> wrote: 
It's really easy to granularly slow down the processing with node, but it's more subtle to signal at what rate is good, any suggestions here are welcome!

If these events happen about as fast as the 1-min load average is calculated (10sec?) it might be as close as we can get to throttling too.
 
Ideally, there'd be a small shared state file (JSON on disk?) containing the current delay. Have an event handler on a setTimeout that updates it periodically (every 10 seconds sounds good) and do some butt-simple calculation based on the 1-minute load average taken from running uptime (sorry, for this as well as so many other things, Windows developers!). Most of the time the event handlers should be loading the shared state directly out of buffer cache and only one thing is updating it, so the client threads could read from it as often as they like without fear.

This has been your moment of Software Design: The djb Way®.

F

Jeremie Miller

unread,
Jan 30, 2012, 12:09:39 AM1/30/12
to singl...@googlegroups.com, singl...@googlegroups.com
Hmm, I don't think the mechanics are too much different in the end result, I'm wondering more about the calculation of delay based on the load avg, the most rudimentary of which is a simple on/off state relative to the sample period I think. Wouldn't that be a place to start?

Forrest L Norvell

unread,
Jan 30, 2012, 12:19:02 AM1/30/12
to singl...@googlegroups.com, singl...@googlegroups.com
I still think it's more deterministic to have a short delay between the processing of each individual event instead of stopping between processing batches of events. Or are we talking at cross purposes?

Sent from my iPhone

Jeremie Miller

unread,
Jan 30, 2012, 12:32:10 AM1/30/12
to singl...@googlegroups.com, singl...@googlegroups.com
In fetching existing data to process (in the present need) it's inherently page/batch based, but in general I agree totally, state-tracking costs ignored.

This thread makes me think a third event of #warn will be very helpful, a granular slow down.  I'll prototype it up soon in the timeline pull req :)

Thx!

Jer

Matt Zimmerman

unread,
Feb 6, 2012, 6:56:46 PM2/6/12
to singl...@googlegroups.com
Load average is not a good indicator of whether to throttle back. It can be
both deceptively high and deceptively low, and the same number can mean
different things in different conditions. I recommend against using it for
this purpose.

How about measuring how long each chunk takes, and slowing down or speeding
up according to that? That should adapt well to different conditions and
types of workloads.

--
- mdz

Jeremie Miller

unread,
Feb 6, 2012, 9:41:51 PM2/6/12
to singl...@googlegroups.com
The chunks are raw "wild" data and really uneven, particularly in the amount of work processing. Until there's a more accurate measure of available capacity than load that locker core can use, this is better than nothing, and it's also nicely abstracted so that the data crunching isn't tightly bound to any methodology (core can change anytime, even be fed from a system-wide signal).

Jer
Reply all
Reply to author
Forward
0 new messages