How to parallelize tests executed by @DataProvider?

1,604 views
Skip to first unread message

Vitaliy Pomazyonkov

unread,
Feb 24, 2009, 5:17:41 AM2/24/09
to testng-users
If my test uses @DataProvider that returns a number of data sets, then
all executions of my test are running sequential (testng.xml has
parallel="tests" at <suite/> and parallel="methods" at <test/>).
But if not using @DataProvider then all works fine and different
methods successfully execuded in parallel.

Cédric Beust ♔

unread,
Feb 24, 2009, 10:20:02 AM2/24/09
to testng...@googlegroups.com
Yes, it's a known limitation of @DataProvider...

--
Cedric

--
Cédric


Помазёнков Виталий

unread,
Feb 24, 2009, 4:02:38 PM2/24/09
to testng...@googlegroups.com
Is there any reason for this limitation? Or it's only difficult to implement?
-- 
Виталий


Cédric Beust ♔ пишет:

Cédric Beust ♔

unread,
Feb 24, 2009, 4:34:24 PM2/24/09
to testng...@googlegroups.com
I would have to double check, but I'd say that it would be a bit difficult to implement as it is right now.

--
Cedric

--
Cédric


Помазёнков Виталий

unread,
Feb 24, 2009, 4:47:42 PM2/24/09
to testng...@googlegroups.com
Thanks, I hope that it will be implemented, because it very useful and natural for TestNG.
Now I have some tests that must be run on >200 different data sets and when run in sequential it takes about 40 minutes.

David Garcia

unread,
Apr 6, 2009, 8:08:21 PM4/6/09
to testng-users
Are there any plans to implement this in the near future?
I agree it would be a huge improvement for data based testing.

On Feb 24, 2:47 pm, Помазёнков Виталий <vit...@yandex.ru> wrote:
> Thanks, I hope that it will be implemented, because it very useful and
> natural for TestNG.
> Now I have some tests that must be run on >200 different data sets and
> when run in sequential it takes about 40 minutes.
>
> --
> Виталий
>
> Cédric Beust ♔ пишет:
>
> > I would have to double check, but I'd say that it would be a bit
> > difficult to implement as it is right now.
>
> > --
> > Cedric
>
> > On Tue, Feb 24, 2009 at 1:02 PM, Помазёнков Виталий <vit...@yandex.ru
> > <mailto:vit...@yandex.ru>> wrote:
>
> >     Is there any reason for this limitation? Or it's only difficult to
> >     implement?
>
> >     --
> >     Виталий
>
> >     Cédric Beust ♔ пишет:
> >>     Yes, it's a known limitation of @DataProvider...
>
> >>     --
> >>     Cedric
>
> >>     On Tue, Feb 24, 2009 at 2:17 AM, Vitaliy Pomazyonkov
> >>     <vit...@yandex.ru <mailto:vit...@yandex.ru>> wrote:
>
> >>         If my test uses @DataProvider that returns a number of data
> >>         sets, then
> >>         all executions of my test are running sequential (testng.xml has
> >>         parallel="tests" at <suite/> and parallel="methods" at <test/>).
> >>         But if not using @DataProvider then all works fine and different
> >>         methods successfully execuded in parallel.
>
> >>     --
> >>     **/*Cédric*
> >>     /
>
> > --
> > **/*Cédric*
> > /

loneranger

unread,
Apr 7, 2009, 1:47:47 PM4/7/09
to testng-users
If there is any sort of voting to pick as feature request, I would
vote for this.

Thanks,
Tilak

Cédric Beust ♔

unread,
Apr 8, 2009, 3:09:44 PM4/8/09
to testng...@googlegroups.com, Pomazyonkov Vitaliy, david.g...@gmail.com, tilak...@gmail.com

On Mon, Apr 6, 2009 at 5:08 PM, David Garcia <david.garcia.mx@gmail.com> wrote:
Are there any plans to implement this in the near future?
I agree it would be a huge improvement for data based testing.

I made some progress with multithreaded data providers, please try:


This version will run all your data providers in their own thread pool of ten threads, so the basic logic is in place.  Please test it and let me know how it works for you.

My question now is:  how do we configure this?

Right now, I'm considering adding attributes to @DataProvider:

@DataProvider(threadPoolSize = 10, timeOut = 500)
public Object[][] dp() { ... }

I am also considering adding attributes in the XML files that would apply to all the data providers:

<sulte data-provider-thread-pool-size = "10", data-provider-time-out = "500">

but I wonder if this would be really useful since suddenly making *all* your data providers multithreaded can turn out to be problematic.  If we decide to go down that path, we should probably consider adding yet another attribute to @DataProvider to turn off multithreading for this particular data provider.

Opinions welcome.

--
Cédric


Erik Putrycz

unread,
Apr 8, 2009, 3:37:29 PM4/8/09
to testng...@googlegroups.com, Pomazyonkov Vitaliy, david.g...@gmail.com, tilak...@gmail.com
I would suggest making the threadpool size the number of CPUs by default.
And could this parameter be on the @Test directly and use a "global" thread pool for all testng?
Just worried in case the test runner is already executing tests in parallel and then the @DataProvider floods the number of threads.

Erik.
-- 
Erik Putrycz, Ph.D - http://blog.erikputrycz.net - Mobile: 613-286-6365

Cédric Beust ♔

unread,
Apr 8, 2009, 5:48:38 PM4/8/09
to testng...@googlegroups.com, Pomazyonkov Vitaliy, david.g...@gmail.com, tilak...@gmail.com
On Wed, Apr 8, 2009 at 12:37 PM, Erik Putrycz <erik.p...@gmail.com> wrote:
I would suggest making the threadpool size the number of CPUs by default.
And could this parameter be on the @Test directly and use a "global" thread pool for all testng?

It seems to make more sense to put this attribute on @DataProvider but I haven't really thought about the pros of cons of each approach.  What are your thoughts?

As for the global pool, I think it's a bit tricky from an implementation perspective since I'm not sure it's possible to start an ExecutorService with a set of tasks, make it block until all the threads terminate and while it's waiting, add more tasks to it.

Can somebody more familiar with java.util.concurrent comment on this scenario?

-- 
Cédric


Erik Putrycz

unread,
Apr 9, 2009, 1:00:38 AM4/9/09
to testng...@googlegroups.com
On 08/04/2009 5:48 PM, Cédric Beust ♔ wrote:


On Wed, Apr 8, 2009 at 12:37 PM, Erik Putrycz <erik.p...@gmail.com> wrote:
I would suggest making the threadpool size the number of CPUs by default.
And could this parameter be on the @Test directly and use a "global" thread pool for all testng?

It seems to make more sense to put this attribute on @DataProvider but I haven't really thought about the pros of cons of each approach.  What are your thoughts?
My concern is if you end up running in parallel n tests with each a @DataProvider with 10 threads - you end up with n * 10 threads. Having the thread count on data provider or global isn't so much the issue, it is rather to limit the number of threads.
Another reason would be to keep things simple. In the end, I'd like to be able to simply say @DataProvider(parallel = true) instead of the number of threads - which is not necessarily something easy to figure out when you write a test.



As for the global pool, I think it's a bit tricky from an implementation perspective since I'm not sure it's possible to start an ExecutorService with a set of tasks, make it block until all the threads terminate and while it's waiting, add more tasks to it.
I did quite a bit of work with the concurrent API. Until you don't call shutdown on the ThreadPoolExecutor, you can still submit new tasks. However, I don't remember exactly if I figured out a good way to wait for the tasks to terminate - I'll need to check. In the end, I ended up redesigning all concurrent stuff in my application with Scala and actors.

Anyway, there is a getCompletedTaskCount() in the ThreadPoolExecutor - it should be possible to loop and verify if all tasks are completed.

Erik.

Bill Michell

unread,
Apr 9, 2009, 4:50:55 AM4/9/09
to testng...@googlegroups.com
You can't do this with the standard classes without adding some code, but ThreadPoolExecutor provides a number of extension points. You would have to override these to get this functionality. However, the javadoc at http://java.sun.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.html shows an example of creating a pausable executor that gets pretty close to what you would need. 

2009/4/9 Erik Putrycz <erik.p...@gmail.com>



--
bill.m...@googlemail.com
bi...@mics.org.uk

Cédric Beust ♔

unread,
Apr 9, 2009, 11:42:13 AM4/9/09
to testng...@googlegroups.com
On Wed, Apr 8, 2009 at 10:00 PM, Erik Putrycz <erik.p...@gmail.com> wrote:

My concern is if you end up running in parallel n tests with each a @DataProvider with 10 threads - you end up with n * 10 threads. Having the thread count on data provider or global isn't so much the issue, it is rather to limit the number of threads.
Another reason would be to keep things simple. In the end, I'd like to be able to simply say @DataProvider(parallel = true) instead of the number of threads - which is not necessarily something easy to figure out when you write a test.

Yes, I agree that this would make things simpler.  However, having all the data providers share a common pool is a bit tricky, see below.
 

I did quite a bit of work with the concurrent API. Until you don't call shutdown on the ThreadPoolExecutor, you can still submit new tasks. However, I don't remember exactly if I figured out a good way to wait for the tasks to terminate - I'll need to check. In the end, I ended up redesigning all concurrent stuff in my application with Scala and actors.

That's unfortunate.

The main problem here is to know when to exit.  For example, let's say we start by running a DataProvider that returns 3 results, so 3 threads get allocated.  Then another DataProvider returns 5 results, 5 more threads get allocated.

Now we have an Executor that contains a mix of data provider worker threads and it's no longer obvious when to complete the test method that triggered the first batch and when to do the same for the second test method...

It looks like I need to separate the concept of an Executor and that of a thread pool.  Thread pools always get shared between data provider but each submission is monitoring its own set of workers.

Gonna have to think about this more.

--
Cédric


Bill Michell

unread,
Apr 9, 2009, 11:49:10 AM4/9/09
to testng...@googlegroups.com


2009/4/9 Cédric Beust ♔ <cbe...@google.com>

The main problem here is to know when to exit.  For example, let's say we start by running a DataProvider that returns 3 results, so 3 threads get allocated.  Then another DataProvider returns 5 results, 5 more threads get allocated.

That is easier. invokeAll <http://java.sun.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html#invokeAll(java.util.Collection)> blocks until all the child threads have run. Its sister method will terminate early if a timeout expires.

What that doesn't do necessarily is ensure that tasks submitted from differing threads don't interleave - but if you don't care about that, you're cooking on gas.

--
bill.m...@googlemail.com
bi...@mics.org.uk

Cédric Beust ♔

unread,
Apr 9, 2009, 12:01:11 PM4/9/09
to testng...@googlegroups.com


On Thu, Apr 9, 2009 at 8:49 AM, Bill Michell <bill.m...@googlemail.com> wrote:
That is easier. invokeAll <http://java.sun.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html#invokeAll(java.util.Collection)> blocks until all the child threads have run. Its sister method will terminate early if a timeout expires.

Yes, but this is not good enough:  I don't want these to block or else I won't be invoking other test methods while we wait for the result.

What I need is getting a Future for each of these DataProvider worker threads and wrap up the test method when all the Futures for this specific data provider have completed, but this all needs to be done asynchronously.

--
Cédric


Bill Michell

unread,
Apr 9, 2009, 12:07:30 PM4/9/09
to testng...@googlegroups.com


2009/4/9 Cédric Beust ♔ <cbe...@google.com>
So you wrap the method which calls invokeAll up in a task, which you then submit as a task to an ExecutorService, and once it returns, you do the method tidying up in the same task.

--
bill.m...@googlemail.com
bi...@mics.org.uk

Cédric Beust ♔

unread,
Apr 9, 2009, 12:12:42 PM4/9/09
to testng...@googlegroups.com

Yup, that's exactly what I'm experimenting with right now...

--
Cédric


Cédric Beust ♔

unread,
Apr 9, 2009, 2:21:15 PM4/9/09
to testng...@googlegroups.com
Ok, I was able to implement the "global data provider thread pool" and it seems to be working well.  It also interacts well with the "test thread pool".

Let me explain.

Consider the following example where two methods, f() and f2(), use data providers that feed them the values 1,2,3,4 (for f()) and 11,12,13,14 (for f2()).

Here is the execution with no threading at all:

Thread:1 f2(11)
Thread:1 f2(12)
Thread:1 f2(13)
Thread:1 f2(14)
Thread:1 f(1)
Thread:1 f(2)
Thread:1 f(3)
Thread:1 f(4)

The two methods are invoked sequentially and the trace shows the order in which their respective data providers are supplying the values.

Here is the execution with no test threading but data provider threading on (the thread pool size is 3):

Thread:9 f2(11)
Thread:9 f2(14)
Thread:11 f2(13)
Thread:10 f2(12)  (pause here)
Thread:9 f(1)
Thread:9 f(2)
Thread:10 f(3)
Thread:9 f(4)

f2 and f1 are still invoked sequentially but their data provider invocations are now done on separate threads, which is why the values received are no longer in sequence.  Since the data providers return four values and there are only three threads available, we notice pauses when all the threads are allocated.  The overall number of threads is still 3 (the data provider thread pool size).

Now, here is the execution with "maximal threading":  test thread on (parallel="methods") and data provider threading on:

Thread:16 f(2)
Thread:15 f2(13)
Thread:14 f2(12)
Thread:12 f2(11)
Thread:17 f(3)
Thread:13 f(1)
Thread:15 f2(14)
Thread:17 f(4)

This time, each method is invoked in its own thread and in turn, each data provider invocation is running in its own thread, so we are seeing 2*3=6 threads in action.  Both the methods and the values they receive are interleaved.

So far so good.

Now I still need to figure how  to configure the data provider thread pool and how to give users as much flexibility as possible.

I think <suite> is a good location since this is where we configure test threading as well.  The data provider thread pool will be a singleton since we don't want to run the risks of saturating the OS thread pool:

<suite name="foo" data-provider-thread-pool-size="15">

Now, users might still not want this threading to apply to all their data providers, so I need to provide a way to turn this threading off or on at the data provider level.  The question is:  what would be a good default?

@DataProvider(parallel = true)

or

@DataProvider(parallel = false)

?


A default of true seems to make sense at first:  if you specify the thread pool size in your XML file, you probably don't want to go to all your @DataProvider and set this attribute to true manually on top of that.  However, turning threading on for all your providers with this one line addition to your XML file might cause some of your tests to fail, and then you will have to go to all these tests and turn threading off manually.

What do you guys think?

--
Cedric

--
Cédric


Cédric Beust ♔

unread,
Apr 9, 2009, 3:17:38 PM4/9/09
to testng...@googlegroups.com


2009/4/9 Cédric Beust ♔ <cbe...@google.com>


Now, here is the execution with "maximal threading":  test thread on (parallel="methods") and data provider threading on:

Thread:16 f(2)
Thread:15 f2(13)
Thread:14 f2(12)
Thread:12 f2(11)
Thread:17 f(3)
Thread:13 f(1)
Thread:15 f2(14)
Thread:17 f(4)

This time, each method is invoked in its own thread and in turn, each data provider invocation is running in its own thread, so we are seeing 2*3=6 threads in action.  Both the methods and the values they receive are interleaved.

Actually, this was a buggy behavior.  The number of threads used to run data provider invocations should always be the same, or we run the risk of starving OS threads, which is why we went with a "fixed thread pool for data providers" in the first place.  The behavior now shows:

Thread:13 f(2)
Thread:12 f2(11)
Thread:11 f(1)
Thread:11 f2(12)
Thread:13 f2(13)
Thread:12 f2(14)
Thread:13 f(3)
Thread:12 f(4)

Which is what I wanted:
  • Methods interleaved (f, f2, f, f2, f2, etc...)
  • Data provider invocations interleaved (2,1,3,4)
  • Running on the data provider thread pool, regardless of the test thread pool setting.
Now, take a look at what happens if I add two regular test methods, f3() and f4():

Thread:8 f4()
Thread:10 f3()
Thread:13 f2(11)
Thread:14 f(1)
Thread:15 f2(12)
Thread:13 f(2)
Thread:15 f(3)
Thread:14 f(4)
Thread:15 f2(13)
Thread:13 f2(14)

While the data provider invocations share the threads 13, 14 and 15, the non parameter test methods have been invoked on their own separate thread, picked from the thread test pool.

If I turn off test threading, I get:

Thread:1 f4()
Thread:9 f2(11)
Thread:10 f2(12)
Thread:11 f2(13)
Thread:11 f2(14)
Thread:1 f3()
Thread:10 f(3)
Thread:9 f(2)
Thread:11 f(1)
Thread:9 f(4)

f3() and f4() are now running on the same thread which is still separated from the data provider thread pool.


--
Cédric


Stevo Slavić

unread,
Apr 10, 2009, 10:42:25 AM4/10/09
to testng...@googlegroups.com
More tests parallelization, sweet.

Are new developments in java.util.concurrent, in form of forkjoin library, applicable in this DataProvider parallelization scenario and/or TestNG tests parallelization in whole? This library is supposed to be part of Java 7 but it is open source and can already be used/tested. More info can be found at Concurrency Interest Site and also mr Brian Goetz gave this nice presentation about it at Devoxx 2008.

Regrads,
Stevo.


2009/4/9 Cédric Beust ♔ <cbe...@google.com>

Cédric Beust ♔

unread,
Apr 10, 2009, 11:49:16 AM4/10/09
to testng...@googlegroups.com
Hi Stevo,

I'll look it up but I think I have all I need with java.util.concurrent as it stands right now.

I was considering sending my code, or at least my design, to Brian for review anyway, so we'll see what he says.

--
Cédric

Reply all
Reply to author
Forward
0 new messages