Issue 71: Multithread batch task - how to proceed

11 views
Skip to first unread message

Martin Wunderlich

unread,
Jun 30, 2015, 1:39:23 AM6/30/15
to dkpro-lab-...@googlegroups.com
Hi,

I am moving this question here from the issue thread on Github:
I was wondering what the next step would be once the MultiThreadBatchTask is working ok. Would it be necessary to create multi-threaded versions of all the existing BatchTasks, such as "ExperimentTrainTest"? That seems a bit too much work. After all, AFAICS the only difference between BatchTask and MultiThreadBatchTask is in the way executeConfiguration(...) is implemented. Also, it would be handy, if the number of threads to use could be set as a parameter when instantiating the BatchTask. I will have a think about it. Sounds a bit like the factory pattern might be handy here or perhaps injecting some kind of BatchTaskExector object.
What are your thoughts? How should this be modeled?

Cheers,

Martin
 

Richard Eckart de Castilho

unread,
Jun 30, 2015, 8:29:37 AM6/30/15
to dkpro-lab-...@googlegroups.com
Hi,

actually, DKPro Lab does feature a factory model for tasks, but in the case of the BatchTask, I unfortunately didn't apply it.

A task in DKPro Lab is modeled as:

- a "task" class - holds configuration, maps parameters from the parameter space to parameters of the underlying tools, e.g.
some "ngramSize" parameter on the experiment level would be injected into a "PARAM_NGRAM_SIZE" on the UIMA level.

- a "task engine" class - uses the configuration to actually run a task, e.g. uses the UIMA descriptor produced from a
"task" class and hands it over to uimaFIT for execution

- a "task execution" service - locates the correct "task engine" given a task, provides task with execution context, and
runs task using the engine,

So we have a UimaTask (UimaTaskBase abstract class) and multiple "engines" for UIMA, e.g. using uimaFIT or using the CPE.

We also have a very simple ExecutableTask (ExecutableTaskBase) which basically is just a runnable.

Now what I should have done would be to implement the parameter sweeping logic in a BatchTaskEngine and just keep the information
about which subtasks exist in the BatchTask. This would have had the effect that somebody would at some point implement a
MultiThreadedBatchTaskEngine and would just swap it in as a replacement of the BatchTaskEngine (e.g. by replacing a Maven dependency)
and that's it.

However, I implemented the BatchTask in terms of the ExecutableTask and didn't make a proper separation between configuration and
run logic. So maybe creating that separation is what you are looking for.

Cheers,

-- Richard

Martin Wunderlich

unread,
Jun 30, 2015, 4:02:03 PM6/30/15
to dkpro-lab-...@googlegroups.com
Hi Richard, 

Thanks a lot for the additional explanation. 
Do you have a rough estimate how much work it would be to convert the current implementation of BatchTasks to the factory-based one? Maybe an initial solution could just sub-class the MultiThreadBatchTask with a default thread count of 1, as suggested by Johannes (in the issue 71 thread). 

BTW, talking of multi-threading, here is a nice visualization of multithreaded programming - theory and practice. 
(I hope the image gets through in the google group list). 

Cheers, 

Martin
 



--
You received this message because you are subscribed to the Google Groups "dkpro-lab-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-lab-develo...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Martin Wunderlich

unread,
Jul 1, 2015, 1:25:58 AM7/1/15
to dkpro-lab-...@googlegroups.com
I've submitted the pull request with my changes to MultiThreadBatchTask on github.

Richard Eckart de Castilho

unread,
Jul 1, 2015, 8:40:36 AM7/1/15
to dkpro-lab-...@googlegroups.com
On 30.06.2015, at 22:01, Martin Wunderlich <mar...@wunderlich.com> wrote:

> Do you have a rough estimate how much work it would be to convert the current implementation of BatchTasks to the factory-based one?

In theory, it should be straightforward:

- all code except the list containing the subtasks should be moved to the BatchTaskEngine
- the existing BatchTask should be renamed to BatchTaskBase or DefaultBatchTask
- a new interface BatchTask should be introduced
- the new interface and engine should be registered in src/main/resources/META-INF/lab/engines.properties

That said, the BatchTask probably would also require new callback. Currently, we have some cases where we override the "execute" method of the BatchTask and inside it, we may create a new sub-parameter-space and even dynamically create subtasks (for multi-threading, this is going to be essential!). So the BatchTask should define methods "beforeExecute" and "afterExecute" which can be customized and which the BatchTaskEngine would be expected to call at the beginning/end of task execution.

Why is this important? As I said before, a Task represents an adapter between the parameter space and the underlying parameterized code. This adapter is *reused*. That means, when a new point in the parameter space is reached, no new task instance is created, but an existing task instance is *reconfigured* with the new parameters. This would of course not work in a multi-threaded environment. So the idea is here, that task instances are dynamically created inside a BatchTask and that there be either a pool of instances that can be reused or that a new instance is created for each parameter configuration. Thus, the BatchTask would become a kind of a factory for its own subtasks.

It's been a while since I was deep in the code... I might be missing details.

Cheers,

-- Richard

Martin Wunderlich

unread,
Jul 3, 2015, 4:28:51 PM7/3/15
to dkpro-lab-...@googlegroups.com
Thanks a lot, Richard, for the detailed explanation. I might have some over the weekend to dig into the class structure and how much effort it might be to convert the BatchTask to using the factory implementation

In the meantime, I have created a multi-thread version of the ExperimentTrainTest batch task, which is currently running an experiment from project and it is taking some time, more than expected. I was assuming that the individual preprocessing and feature extraction steps would be modeled as sub-tasks for the respective overall tasks and should therefore benefit from the multi-threaded version by being executed in parallel - one CAS per thread. However, this doesn't seem to be the case: At least the pre-processing was still run in a linear fashion (it's not gotten as far as the FE steps). I suppose I need to dig a bit deeper to understand why that is. In the meantime, maybe someone can think up a quick hack that would run each individual preprocessing and feature extraction subtasks in separate threads using the MultiThreadBatchTask.

Cheers,

Martin

Richard Eckart de Castilho

unread,
Jul 7, 2015, 7:04:38 PM7/7/15
to dkpro-lab-...@googlegroups.com
I refactored the BatchTask stuff into a Task/Engine pair - but it is not complete yet.

The next step in the refactoring would move the BatchTaskEngine and MultiThreadedBatchTaskEngine into separate modules so that depending on which module is on the classpath one or the other would be used. Right now, only the BatchTaskEngine is used.

But that's just cosmetic really. The crux of the matter is a different one.

When an experimental setup is assembled in DKPro Lab, it is done using "Tasks". Every task is an object instance created before the experiment is run. When a batch task does parameter sweeping, there is an outer loop which takes these instances and injects the parameter configurations into them. When all parameters have been injected, there is an inner loop which tries to execute the configured task (using a suitable task engine). The MultiThreadedBatchTask(Engine) parallelizes the inner loop but not the outer one. That is the reason why we see no speed-ups yet.

The solution would obviously be to parallelize the outer loop, but that is not possible because the outer loop *reconfigures* existing task instances. This cannot be done in parallel because we only have one task instance per task and the threads would concurrently reconfigure that single instance.

I see three approaches to mitigate this:

1) require that all tasks use ThreadLocals to store discriminator values / property values

2) apply a factory pattern and turn a Task into a TaskTemplate. Instead of executing a task directly, the template would be used to create the real task instance which can then be configured. Obviously, we can then create as many instances as we would want.

3) make the MultiThreadedBatchTask(Engine) smart in such a way that it handles clonable Tasks differently. Instead of simply running a clonable Task, first a clone would be created from the original (the prototype) and the clone would then be configured. Again, we could create any number of clones allowing to parallelize the outer loop (or merge the two loops into one).

I personally tend towards 3. I believe it is a smooth transition and requires the least changes in the architecture and in the client code.

Any opinions or alternative ideas?

Cheers,

-- Richard

Johannes Daxenberger

unread,
Jul 8, 2015, 5:25:56 AM7/8/15
to dkpro-lab-...@googlegroups.com
Why exactly do you suggest option 3? I do not have personal experience with clones in Java, but most people seem to discourage its usage.

- Johannes

Richard Eckart de Castilho

unread,
Jul 8, 2015, 5:40:58 AM7/8/15
to dkpro-lab-...@googlegroups.com
>
> 3) make the MultiThreadedBatchTask(Engine) smart in such a way that it handles clonable Tasks differently. Instead of simply running a clonable Task, first a clone would be created from the original (the prototype) and the clone would then be configured. Again, we could create any number of clones allowing to parallelize the outer loop (or merge the two loops into one).

On 08.07.2015, at 11:25, Johannes Daxenberger <daxen...@ukp.informatik.tu-darmstadt.de> wrote:

> Why exactly do you suggest option 3? I do not have personal experience with clones in Java, but most people seem to discourage its usage.

Basically, because there is a default-implementation of the clone() method that does what it's supposed to do in our case (create a shallow clone). So it doesn't impose additional work on the implementer of a Task to handle the copying (as e.g. in a copy() method or in a copy constructor).

There are certainly good reasons to avoid clone [1], but I see no reason to avoid it at all cost if it solves exactly the problem that we have.

At this point in time, I believe that cloning provides the most painless transition.

Do you see any reason why we should stay clear of it in this particular case?

Cheers,

-- Richard

[1] http://www.artima.com/intv/bloch13.html

Johannes Daxenberger

unread,
Jul 8, 2015, 5:56:46 AM7/8/15
to dkpro-lab-...@googlegroups.com
No, at least at the moment, I do not see a reason to avoid clones in our case. I just wanted to understand whether it is really the best solution. Given that we want to minimize changes on clients, it seems like the best option; so I agree.

- Johannes

Martin Wunderlich

unread,
Aug 1, 2015, 5:45:30 AM8/1/15
to dkpro-lab-developers
Sorry, guys, I missed this message related to my work on the MultiThreadBatchTask. Thanks, Richard, for explaining why the preformance improvements are not as substantial as hoped for.
I would also favor solution 3, since (as far as I can tell) it seems to be the most straightforward variant to implement. I don't know how much work it would be to make the outer loop parallel, but this would need to be done anyways, so that effort is a constant.
How would you differentiate between clonable and non-clonable tasks, though? What are the criteria for distinguishing one from the other? Would the clone() method simply return the instance itself in case of non-clonable tasks?

Cheers,

Martin

Richard Eckart de Castilho

unread,
Aug 15, 2015, 4:33:35 PM8/15/15
to dkpro-lab-...@googlegroups.com
On 01.08.2015, at 11:45, Martin Wunderlich <mar...@wunderlich.com> wrote:

> I would also favor solution 3, since (as far as I can tell) it seems to be the most straightforward variant to implement. I don't know how much work it would be to make the outer loop parallel, but this would need to be done anyways, so that effort is a constant.
> How would you differentiate between clonable and non-clonable tasks, though? What are the criteria for distinguishing one from the other? Would the clone() method simply return the instance itself in case of non-clonable tasks?

Non-clonable tasks could not be parallelized in the outer loop. The MultiThreadBatchEngine would just not even try to clone them. They would be reconfigured as it is the case now.

-- Richard
Reply all
Reply to author
Forward
0 new messages