Test Datasets for Author Profling, Author Diarization, and Author Masking are available on TIRA

33 views

Skip to first unread message

Martin Potthast

unread,

Mar 12, 2016, 3:02:01 PM3/12/16

to pan-workshop-series

Hi everyone,

many of you already claimed their virtual machines for the early bird evaluation. Now also the test datasets for the two tasks author profiling and author diarization are available on TIRA.

All you have to do to proceed is the following:

- Install your software in your virtual machine

- Go to TIRA at www.tira.io and sign in with the same credentials as in your VM

- Browse to your task

- Hit "Add Software" and fill in the form

- Hit "Run" and TIRA will execute your software on the dataset selected

After your software has processed the dataset, you will see a list of runs appearing on TIRA.

- Runs on training datasets can be inspected (click on the eye-icon) and downloaded (click on the arrow down)

- Runs on test datasets are hidden from view to avoid data leaks.

When you have successfully tested your software on one of the training datasets, please run it on the 2016 test datasets for your task.

When you are finished, please contact us and let us know, so we can review your run and release the results to you.

If you have any questions, please don't hesitate to contact us.

Best,

Martin

PS: The test datasets of author clustering are on their way.

Martin Potthast

unread,

Mar 12, 2016, 6:20:34 PM3/12/16

to Ivan Bilan, pan-workshop-series

Hi Ivan,

I have a question, at tira.io after pressing "Add software" one of the command variables is "-r $inputRun", what does it stand for? There is an option below called "Input run", but it only has "none" as a sinlge option.

The $inputRun variable can be used if you submit a software that can also be trained within the virtual machine:

- You install two softwares, the training software and the test software

- The training software takes as input $inputDataset and $outputDir

- When run on a training dataset, the training software trains a model based on $inputDataset and stores the trained model in $outputDir

- The test software takes as input parameters $inputDataset, $outputDir, and $inputRun

- When run on a test dataset, the test software reads the model from $inputRun (i.e., the run produced by the training software above), then processes all the problem instances in the $inputDataset, and stores the results in $outputDir.

This way, your software may also be retrained for other datasets in the future, increasing the reproducibility of results.

I have noticed that in PAN14 description page it says the command should look like this:

myTrainingSoftware -i path/to/training/corpus -o path/to/output/directory

Is this still relevant? Is it similar for PAN16? Do I need to output the classification model for PAN16 or XML files based on the Test Set?

The description is very much the same for PAN-16:

http://pan.webis.de/clef16/pan16-web/author-identification.html

Best,

Martin

Dr. Martin Potthast
Bauhaus-Universität Weimar
Digital Bauhaus Lab
Bauhausstr. 9a
99423 Weimar
Germany

+49 3643 58 3567
+49 171 809 1945

www.potthast.net

Reply all

Reply to author

Forward

0 new messages