Re-use intermediate results

ulrik...@hft-stuttgart.de

unread,

Sep 1, 2015, 7:47:15 AM9/1/15

to dkpro-tc-users

Hello,

I've just started to play with dkpro-tc for teaching purposes. I was wondering - is there a way to re-use intermediate results (e.g., the CASes generated during pre-processing or the .arff files) in later experiments?
For now, I understand how to set up an experiment end-to-end using the demos (very straightforward, if I may say), but data processing is taking several hours for realistic data sets (e.g. for Germeval NER training data). This is fine to do once, but I'd like to skip this step in class to save time.
I'd be grateful for any hints.

Best regards,

Ulrike

Richard Eckart de Castilho

unread,

Sep 1, 2015, 8:01:46 AM9/1/15

to ulrik...@hft-stuttgart.de, dkpro-tc-users

Hi Ulrike,

TC is based on DKPro Lab which models an experience basically as a DAG of tasks.
Results are persisted to disk by each task. If one task needs data from another
task, it needs to "import" it. On top of this, parameter sweeping is supported
such that for one task you typically have multiple results (also called contexts).

When an experiment is run, it is possible to set an execution policy as "USE_EXISTING"
which means that if a task was already executed under a compatible parameter configuration,
it is loaded from disk instead of being recalculated.

I *guess* that there should be some way to set up a DKPro TC experiment such that
it only runs the preprocessing task and stops before running train/test. If not,
it might be easy to add. Under such a scenario, you could prepare a "repository"
(that is basically a folder where DKPro Lab stores all the results / contexts),
distribute that to the machines you use in class and then have them run the
full experiment (preprocessing + traintest). Under a USE_EXISTING policy,
TC/Lab should be able to pick up the pre-computed results and just run the
train/test step.

I'm sure some of the core TC developers can provide some additional insight.

Cheers,

-- Richard (more the DKPro Lab person on this list)

Johannes Daxenberger

unread,

Sep 1, 2015, 8:19:50 AM9/1/15

to ulrik...@hft-stuttgart.de, Richard Eckart de Castilho, dkpro-tc-users

Hi Ulrike,

thanks for your interest in using DKPro TC :)

Richard already pointed out the most important issues for your request. To use DKPro TC, you must specify a variable called DKPRO_HOME which is basically a directory where the output of your experiments is stored (if you did not set this directory yourself, it is hard-coded in the demo experiment you were running). If you run an experiment, everything which is a DKPro Lab Task (see below) produces output in the form of a folder which stores the data produced by this task (and potentially re-used by the next task). So after you successfully ran your full experiment, you can distribute the output of this experiment to, say, another machine, set the corresponding DKPRO_HOME, and activate the USE_EXISTING execution policy. If you delete parts of the experiment (e.g. the output of the TestTask which performs the machine learning), only this part will be run; if you do not delete anything, only the top-level outer task will be re-run (that should be a matter of seconds).

Please let us know if something is not working as expected.

Best,
Johannes

-----Ursprüngliche Nachricht-----
Von: dkpro-t...@googlegroups.com [mailto:dkpro-t...@googlegroups.com] Im Auftrag von Richard Eckart de Castilho
Gesendet: Dienstag, 1. September 2015 14:02
An: ulrik...@hft-stuttgart.de
Cc: dkpro-tc-users
Betreff: Re: [dkpro-tc-users] Re-use intermediate results

--
You received this message because you are subscribed to the Google Groups "dkpro-tc-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-tc-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ulrik...@hft-stuttgart.de

unread,

Sep 1, 2015, 8:33:16 AM9/1/15

to dkpro-tc-users, ulrik...@hft-stuttgart.de, richard...@gmail.com

Thanks a lot, Richard and Johannes! That's exactly the kind of thing I was looking for.

Best,

Ulrike

Ulrike Pado

unread,

Sep 3, 2015, 6:04:34 AM9/3/15

to Johannes Daxenberger, Richard Eckart de Castilho, dkpro-tc-users

Hello Johannes,

I finished the InitTask step on an experiment and wanted to build on that in a second run (no other execution steps were done yet). When I re-start the experiment, pre-processing also re-starts at the beginning. Why is this?
Is it because the first experiment didn't finish? Or is the path not set correctly? (I kept it unchanged from the first run, so I figured the data would be found).
Also, I have so far generated about 20 GB worth of .bin files for about 10k sentences - is this normal? It seems a bit much given the usual corpus sizes we work with in challenges etc.

Best,

Ulrike

Hochschule für Technik Stuttgart
Fakultät Vermessung, Informatik und Mathematik

Prof. Dr. Ulrike Pado
Professorin für Informatik, Fachgebiet Computerlinguistik

Büro: Bau 2, Zimmer 449

Schellingstr. 24
70174 Stuttgart
www.hft-stuttgart.de

T +49 (0)711 8926 2811
F +49 (0)711 8926 2553
ulrik...@hft-stuttgart.de

signature.asc

Johannes Daxenberger

unread,

Sep 3, 2015, 6:53:03 AM9/3/15

to Ulrike Pado, Richard Eckart de Castilho, dkpro-tc-users

Hi Ulrike,

first of all: USE_EXISTING will only work on tasks which have been properly executed and finished (i.e. the InitTask needs to be fully finished - you can verify this by searching for the corresponding output in DKPRO_HOME - the task context needs to contain a file named DISCRININATORS.txt which holds all parameters you set for this task).

In order to further debug your problem, I need some more information about your experiments:
- which version of DKPro TC are you using (0.8.0-SNAPSHOT or 0.7.0 stable or an older version)?
- which learning and feature mode do you use (e.g. single- vs. multi-label, document vs. unit etc.)?

With regard to the output you get: 20GB sounds a bit too much indeed. How many .bin files were generated? The same number as you have sentences? What do those files contain? One sentence per file? How many annotations do you add during preprocessing?

Best,
Johannes

-----Ursprüngliche Nachricht-----
Von: dkpro-t...@googlegroups.com [mailto:dkpro-t...@googlegroups.com] Im Auftrag von Ulrike Pado
Gesendet: Donnerstag, 3. September 2015 12:04
An: Johannes Daxenberger
Cc: Richard Eckart de Castilho; dkpro-tc-users

Ulrike Pado

unread,

Sep 8, 2015, 4:27:13 AM9/8/15

to Johannes Daxenberger, Richard Eckart de Castilho, dkpro-tc-users

Dear Johannes (and Richard),

just a quick note to indicate success.

Johannes pointed to my underlying problem of trying to read several thousand sentences from one input file. Re-formatting the input as one sentence per file solved my time and space issues completely. Now that tasks are finishing correctly, I can also use USE_EXISTING as expected.

Thanks again for your help! It is greatly appreciated

Ulrike

signature.asc

mar...@wunderlich.com

unread,

Sep 8, 2015, 8:24:27 AM9/8/15

to Ulrike Pado, dkpro-tc-users

Hi Ulrike,

Thanks a lot for reporting back. I might have a similar situation like the one you were facing. Could you perhaps elaborate on what the difference is between reading the sentences all from one file instead of having one file per sentence? In which way does this have an effect on the processing time and space?

Thanks a lot.

Cheers,

Martin

Ulrike Pado

unread,

Sep 8, 2015, 8:46:54 AM9/8/15

to mar...@wunderlich.com, dkpro-tc-users

Hello Martin,

the main point is that you need to use the correct input format for the reader you're using. I was breaking the assumption of the NERDemoReader that there would be one file per sentence. This really bloated my .bin files (17kB -> 2.1 MB) in comparison to using the correct input format, and drove up preprocessing time from a few minutes to many hours (it appears that the whole input file was being saved in each CAS).
I hope this helps solve your problem!

Best,

Ulrike

On 8 Sep 2015, at 14:24, mar...@wunderlich.com wrote:

> Hi Ulrike,
>
> Thanks a lot for reporting back. I might have a similar situation like
> the one you were facing. Could you perhaps elaborate on what the
> difference is between reading the sentences all from one file instead of
> having one file per sentence? In which way does this have an effect on
> the processing time and space?
>
> Thanks a lot.
>
> Cheers,
>
> Martin
>
> Am 2015-09-08 10:27, schrieb Ulrike Pado:
>
>> Dear Johannes (and Richard),
>>
>> just a quick note to indicate success.
>>
>> Johannes pointed to my underlying problem of trying to read several thousand sentences from one input file. Re-formatting the input as one sentence per file solved my time and space issues completely. Now that tasks are finishing correctly, I can also use USE_EXISTING as expected.
>>
>> Thanks again for your help! It is greatly appreciated
>>
>> Ulrike
>>
>> On 3 Sep 2015, at 12:53, Johannes Daxenberger wrote:
>> Hi Ulrike, first of all: USE_EXISTING will only work on tasks which have been properly executed and finished (i.e. the InitTask needs to be fully finished - you can verify this by searching for the corresponding output in DKPRO_HOME - the task context needs to contain a file named DISCRININATORS.txt which holds all parameters you set for this task). In order to further debug your problem, I need some more information about your experiments: - which version of DKPro TC are you using (0.8.0-SNAPSHOT or 0.7.0 stable or an older version)? - which learning and feature mode do you use (e.g. single- vs. multi-label, document vs. unit etc.)? With regard to the output you get: 20GB sounds a bit too much indeed. How many .bin files were generated? The same number as you have sentences? What do those files contain? One sentence per file? How many annotations do you add during preprocessing? Best, Johannes -----Ursprüngliche Nachricht----- Von: dkpro-t...@googlegroups.com

> [mailto:dkpro-t...@googlegroups.com] Im Auftrag von Ulrike Pado Gesendet: Donnerstag, 3. September 2015 12:04 An: Johannes Daxenberger Cc: Richard Eckart de Castilho; dkpro-tc-users Betreff: Re: [dkpro-tc-users] Re-use intermediate results Hello Johannes, I finished the InitTask step on an experiment and wanted to build on that in a second run (no other execution steps were done yet). When I re-start the experiment, pre-processing also re-starts at the beginning. Why is this? Is it because the first experiment didn't finish? Or is the path not set correctly? (I kept it unchanged from the first run, so I figured the data would be found). Also, I have so far generated about 20 GB worth of .bin files for about 10k sentences - is this normal? It seems a bit much given the usual corpus sizes we work with in challenges etc. Best, Ulrike On 1 Sep 2015, at 14:19, Johannes Daxenberger wrote: Hi Ulrike, thanks for your interest in using DKPro TC :) Richard already pointed out the !
> most
> important issues for your request. To use DKPro TC, you must specify a variable called DKPRO_HOME which is basically a directory where the output of your experiments is stored (if you did not set this directory yourself, it is hard-coded in the demo experiment you were running). If you run an experiment, everything which is a DKPro Lab Task (see below) produces output in the form of a folder which stores the data produced by this task (and potentially re-used by the next task). So after you successfully ran your full experiment, you can distribute the output of this experiment to, say, another machine, set the corresponding DKPRO_HOME, and activate the USE_EXISTING execution policy. If you delete parts of the experiment (e.g. the output of the TestTask which performs the machine learning), only this part will be run; if you do not delete anything, only the top-level outer task will be re-run (that should be a matter of seconds). Please let us know if something is not working!
> as
> expected. Best, Johannes -----Ursprüngliche Nachricht----- Von: dkpro-t...@googlegroups.com [mailto:dkpro-t...@googlegroups.com] Im Auftrag von Richard Eckart de Castilho Gesendet: Dienstag, 1. September 2015 14:02 An: ulrik...@hft-stuttgart.de Cc: dkpro-tc-users Betreff: Re: [dkpro-tc-users] Re-use intermediate results Hi Ulrike, TC is based on DKPro Lab which models an experience basically as a DAG of tasks. Results are persisted to disk by each task. If one task needs data from another task, it needs to "import" it. On top of this, parameter sweeping is supported such that for one task you typically have multiple results (also called contexts). When an experiment is run, it is possible to set an execution policy as "USE_EXISTING" which means that if a task was already executed under a compatible parameter configuration, it is loaded from disk instead of being recalculated. I *guess* that there should be some way to set up a DKPro TC experiment such that it onl!
> y runs
> the preprocessing task and stops before running train/test. If not, it might be easy to add. Under such a scenario, you could prepare a "repository" (that is basically a folder where DKPro Lab stores all the results / contexts), distribute that to the machines you use in class and then have them run the full experiment (preprocessing + traintest). Under a USE_EXISTING policy, TC/Lab should be able to pick up the pre-computed results and just run the train/test step. I'm sure some of the core TC developers can provide some additional insight. Cheers, -- Richard (more the DKPro Lab person on this list) On 01.09.2015, at 13:47, ulrik...@hft-stuttgart.de wrote: Hello, I've just started to play with dkpro-tc for teaching purposes. I was wondering - is there a way to re-use intermediate results (e.g., the CASes generated during pre-processing or the .arff files) in later experiments? For now, I understand how to set up an experiment end-to-end using the demos (very straightforw!
> ard, if I
> may say), but data processing is taking several hours for realistic data sets (e.g. for Germeval NER training data). This is fine to do once, but I'd like to skip this step in class to save time. I'd be grateful for any hints. Best regards, Ulrike -- You received this message because you are subscribed to the Google Groups "dkpro-tc-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-tc-user...@googlegroups.com. For more options, visit https://groups.google.com/d/optout [1].

> Hochschule für Technik Stuttgart Fakultät Vermessung, Informatik und
> Mathematik Prof. Dr. Ulrike Pado Professorin für Informatik, Fachgebiet
> Computerlinguistik Büro: Bau 2, Zimmer 449 Schellingstr. 24 70174

> Stuttgart www.hft-stuttgart.de [2] T +49 (0)711 8926 2811 F +49 (0)711

> 8926 2553 ulrik...@hft-stuttgart.de -- You received this message
> because you are subscribed to the Google Groups "dkpro-tc-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dkpro-tc-user...@googlegroups.com. For more

> options, visit https://groups.google.com/d/optout [1].

>
> Hochschule für Technik Stuttgart
> Fakultät Vermessung, Informatik und Mathematik
>
> Prof. Dr. Ulrike Pado
> Professorin für Informatik, Fachgebiet Computerlinguistik
>
> Büro: Bau 2, Zimmer 449
>
> Schellingstr. 24
> 70174 Stuttgart

> www.hft-stuttgart.de [2]

>
> T +49 (0)711 8926 2811
> F +49 (0)711 8926 2553
> ulrik...@hft-stuttgart.de
>
>
>

> Links:
> ------
> [1] https://groups.google.com/d/optout
> [2] http://www.hft-stuttgart.de

signature.asc

Martin Wunderlich

unread,

Sep 8, 2015, 1:44:50 PM9/8/15

to Ulrike Pado, dkpro-tc-users

I see. Thanks for the clarification, Ulrike.
My case is a bit different. I have a source text with approx. 60k word, which gets annotated on sentence level as a training doc, based on some additional data in JSON format. This also results in lots of disk space being used, but mostly due to Lucene indices that are generated at the MetaInfo stage.

Cheers,

Martin

Reply all

Reply to author

Forward