errors in German test data

11 views
Skip to first unread message

Dominik Schlechtweg

unread,
May 7, 2024, 1:07:49 PM5/7/24
to AXOLOTL-24

Hi everyone,

Something I just noticed which may explain some people's low results for German: It seems that quite frequently quotes inside the usages were escaped with double quotes and extra quotes around the whole usage in that case were added like in this example:

Test set: "Statt die Hallstein-Doktrin so unauffällig und rasch wie möglich abzubauen, will"" dieser sture Flügel im Bundeskabinett sie verschärfen."
Original: Statt die Hallstein-Doktrin so unauffällig und rasch wie möglich abzubauen, will" dieser sture Flügel im Bundeskabinett sie verschärfen.

This probably happened when loading or storing the CSV data. If these extra quotes were not removed by the participants when loading the data the target word indices probably select  the right substring as target word.

Best,
Dominik

Dominik Schlechtweg

unread,
May 7, 2024, 2:04:12 PM5/7/24
to AXOLOTL-24
*select the **wrong** substring as target word Dominik

Mariia Fedorova

unread,
May 7, 2024, 2:26:19 PM5/7/24
to axolo...@googlegroups.com

Dear Dominik,

all our datasets for all three languages use the default pandas quoting (that is csv.QUOTE_MINIMAL, which implies that only fields containing special characters are quoted). It is used and always was used in the baseline and scoring scripts. Probably we should have emphasized this in the data descriptions, but this is the default setting, namely, the pandas methods read_csv and to_csv use it without additional arguments.
We double checked and your example is loaded into baselines and scorers without extra quotes.

One should also note that github parses our datasets in the same way, so extra quotes are only seen in the "raw" format there.

Best, 
Maria


From: 'Dominik Schlechtweg' via AXOLOTL-24 <axolo...@googlegroups.com>
Sent: 07 May 2024 20:04:12
To: AXOLOTL-24
Subject: [axolotl] Re: errors in German test data
 
--
You received this message because you are subscribed to the Google Groups "AXOLOTL-24" group.
To unsubscribe from this group and stop receiving emails from it, send an email to axolotl-24+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/axolotl-24/1ea0e3aa-3a79-4bda-b4bb-8090db470654n%40googlegroups.com.
Message has been deleted

Dominik Schlechtweg

unread,
May 10, 2024, 10:17:55 AM5/10/24
to axolo...@googlegroups.com
*select the **wrong** substring as target word

Dominik

Am 07.05.24 um 19:07 schrieb 'Dominik Schlechtweg' via AXOLOTL-24:
> --
> You received this message because you are subscribed to the Google Groups "AXOLOTL-24" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to axolotl-24+...@googlegroups.com <mailto:axolotl-24+...@googlegroups.com>.
> To view this discussion on the web visit https://groups.google.com/d/msgid/axolotl-24/9e393a5c-ae4e-4811-b9d6-bd830b216dc7n%40googlegroups.com <https://groups.google.com/d/msgid/axolotl-24/9e393a5c-ae4e-4811-b9d6-bd830b216dc7n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Mariia Fedorova

unread,
May 10, 2024, 10:28:06 AM5/10/24
to AXOLOTL-24

Yes, as said in the previous email, the datasets for ALL THREE languages use csv.QUOTE_MINIMAL. The original WUG was read with csv.QUOTE_NONE, reformatted to our format with removal of None and 'andere' ('others') sense description entries and saved with default pandas csv.QUOTE_MINIMAL. 




From: axolo...@googlegroups.com <axolo...@googlegroups.com> on behalf of nick arefyev <nick.a...@gmail.com>
Sent: 09 May 2024 16:12
To: AXOLOTL-24
Subject: Re: [axolotl] Re: errors in German test data
 
This is what I had some issues with before as well. The original DWUG datasets do not use extra quoting, all quotes there are quotes coming from the original texts. When using Pandas, they should be loaded with quoting=csv.QUOTE_NONE flag. Maria, have you converted quoting of the DWUG Sense de to be compatible with the default quoting=csv.QUOTE_MINIMAL?
Reply all
Reply to author
Forward
0 new messages