weighted distributed processing.

53 views
Skip to first unread message

Joseph Perry

unread,
May 2, 2012, 6:05:23 PM5/2/12
to archiv...@googlegroups.com, gea...@googlegroups.com, ceph-...@vger.kernel.org
Hello All,
First off, I'm sending this email to three discussion groups:
gea...@googlegroups.com - distributed processing library
ceph-...@vger.kernel.org - distributed file system
archiv...@googlegroups.com - my project's discussion list, a distributed processing system.

I'd like to start a discussion about something I'll refer to as weighted distributed task based processing.
Presently, we are using gearman's library's to meet our distributed processing needs. The majority of our processing is file based, and our processing stations are accessing the files over an nfs share. We are looking at replacing the nfs server share with a distributed file systems, like ceph.

It occurs to me that our processing times could theoretically be reduced by by assigning tasks to processing clients where the file resides, over places where it would need to be copied over the network. In order for this to happen, the gearman server would need to get file location information from the ceph system.

pseudo:
gearman client creates a task & includes a weight, of type ceph file
gearman server identifies the file & polls the ceph system for clients that have this file
ceph system returns a list of clients that have the file locally
gearman assigns the task
.    if there is a client available for processing that has the file locally
.        assign it there
.        (that client has local access to the file, still on the ceph system)
.    else
.        assign to other client
.        (that processing client will pull the file from the ceph system over the network)


I call it a weighted distributed processing system, because it reminds me of a weighted die: The outcome is influenced to a certain direction (in the task assignment).

I wanted to start this as a discussion, rather than filing feature requests, because of the complex nature of the requests, and the nicer medium for feedback, clarification and refinement.

I'd be very interested to hear feedback on the idea,
Joseph Perry


Clint Byrum

unread,
May 2, 2012, 6:42:51 PM5/2/12
to Joseph Perry, archivematica, gearman, ceph-devel
Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012:
> Hello All,
> First off, I'm sending this email to three discussion groups:
> gea...@googlegroups.com - distributed processing library
> ceph-...@vger.kernel.org - distributed file system
> archiv...@googlegroups.com - my project's discussion list, a
> distributed processing system.
>
> I'd like to start a discussion about something I'll refer to as weighted
> distributed task based processing.
> Presently, we are using gearman's library's to meet our distributed
> processing needs. The majority of our processing is file based, and our
> processing stations are accessing the files over an nfs share. We are
> looking at replacing the nfs server share with a distributed file
> systems, like ceph.
>
> It occurs to me that our processing times could theoretically be reduced
> by by assigning tasks to processing clients where the file resides, over
> places where it would need to be copied over the network. In order for
> this to happen, the gearman server would need to get file location
> information from the ceph system.
>

If I understand the design of CEPH completely, it spreads I/O at the
block level, not the file level.

So there is little point in weighting since it seeks to spread the whole
file across all the machines/block devices in the cluster. Even if you
do ask ceph "which servers is file X on", which I'm sure it could tell
you, You will end up with high weights for most of the servers, and no
real benefit.

In this scenario, you're just better off having a really powerful network
and CEPH will balance the I/O enough that you can scale out the I/O
independently of the compute resources. This seems like a huge win, as
I don't believe most workloads scale at a 1:1 I/O:CPU ratio. 10Gigabit
switches are still not super cheap, but they are probably cheaper than
software engineer hours.

If your network is not up to the task of transferring all those blocks
around, you probably need to focus instead on something that keeps whole
files in a certain place. One such system would be MogileFS. This has a
database with a list of keys that say where the data lives, and in fact
the protocol the MogileFS tracker uses will tell you all the places a
key lives. You could then place a hint in the payload and have 2 levels
of workers. The pseudo becomes:

-workers register two queues. 'dispatch_foo', and 'do_foo_$hostname'
-client sends task w/ filename to 'dispatch_foo'
-dispatcher looks at filename, asks mogile where the file is, looks at
recent queue lengths in gearman, and decides whether or not it is enough
of a win to direct the job to the host where the file is, or to farm it
out to somewhere that is less busy.

This will take a lot of poking at to get tuned right, but it should be
tunable to a single number, the ratio of localized queue length versus
non-localized queue length.

berwin22

unread,
May 3, 2012, 8:10:42 AM5/3/12
to gearman
Greg Farnum wrote the following:

Clint is mostly correct: Ceph does not store files in a single
location. It's not block-based in the sense of 4K disk blocks though —
instead it breaks up files into (by default) 4MB chunks. It's possible
to change this default to a larger number though; our Hadoop bindings
break files into 64MB chunks. And it is possible to retrieve this
location data using the cephfs tool:
./cephfs
not enough parameters!
usage: cephfs path command [options]*
Commands:
show_layout -- view the layout information on a file or dir
set_layout -- set the layout on an empty file,
or the default layout on a directory
show_location -- view the location information on a file
Options:
Useful for setting layouts:
--stripe_unit, -u: set the size of each stripe
--stripe_count, -c: set the number of objects to stripe across
--object_size, -s: set the size of the objects to stripe across
--pool, -p: set the pool to use
Useful for getting location data:
--offset, -l: the offset to retrieve location data for

I suspect this provides the information you're looking for?

-Greg

On May 2, 3:42 pm, Clint Byrum <cl...@fewbar.com> wrote:
> Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012:
>
>
>
>
>
>
>
>
>
> > Hello All,
> > First off, I'm sending this email to three discussion groups:
> > gea...@googlegroups.com - distributed processing library
> > ceph-de...@vger.kernel.org - distributed file system

berwin22

unread,
May 3, 2012, 8:12:59 AM5/3/12
to gearman
I'm finding it easier if I break this down into sections.
The concepts
The use cases
The technical feasibility
The cost vs. benefit


Concepts:
If a distributed processing system is processing files on a
distributed file system, processing times would be improved by
assigning tasks to systems that contain the binaries for the files,
and won't have to pull them over the network.

Gearman is great for distributed processing when the data to be
processed is small enough to be contained in the task: Send the data
to the resources to process it.
I feel Gearman starts to have scalability problems when the data is
too large to reasonably include in each task. If you can distribute
the data into chunks related to tasks (in our case files), and
distribute these chunks "close to" processing resources, then there is
an opertunity for improved performance.

Use Cases:
Our use case is specific to file processing.
I'm curious if anyone has a different use case that would benefit from
a similar weighted task distrobution.

Technical feasibility:
Give a bunch of coders enough time, money, encouragement, and caffeine
they can accomplish amazing things.

That being said, I don't believe the tools for this exist yet, but I
think the concepts outlined here are feasible with a lot of work, that
would need a lot of thought. Clint's suggestion is worth looking at.

My dream file system for a distributed processing file system would
include the following features:
-redundancy
--At least one copy residing on a single node (for processing
availibility)
--A copy broken into chunks (for fast dissemination across the
network)
-kernel level event notification (useable by inotify)
-active monitoring for bit-rot
-a mechanism to identify files, and their data on the network

Cost vs. benefit
Clint is right: "10Gigabit switches are still not super cheap, but
they are probably cheaper than software engineer hours." Where this
would start to loose ground is where there are enough deployments,
where the costs of all those non-blocking switches and nics exceeds
the costs of open source development.
The other caviot is when the network scales to be too large to fit on
a single switch. The traffic on the backbone, between the switches,
will include the normal file system traffic and all the pulls of the
files.

Then question then arises: how much of a decrease in network traffic
would this result in?
For our system, this would be a significant decrease in traffic. We
are using gearman to distribute multiple steps in our workflows, where
multiple tasks/steps read the entire file.
However, benefit currently wouldn't justify the cost for us to develop
or fund these features. Maybe in a few years there will be enough
community demand/support. In the mean time, I'm enjoying the
theoretical discussion.
Reply all
Reply to author
Forward
0 new messages