About hidden dataset!

kos...@alcinia.net

unread,

Feb 27, 2020, 2:04:38 PM2/27/20

to SIGMOD 2020 Contest Google Group

Hello,

I am Kostas of team ENdiTY! Interesting theme this year! Even though it's not as performance oriented as we expected it to be, we are excited to experiment with machine learning approaches on the field of databases! We appreciate the challenge and we will give our best efforts!

Our team has the following questions to the organisers!

1) On the hidden dataset, which you will evaluate on our solutions, will the camera specifications change along with the labels dataset provided? Or you will provide only a different labels dataset?

2) Can you provide any information about the size of the hidden dataset? Specifically, number of specifications and number of labels!

3) What is the main focus of this task?

It seems some teams have already achieved near perfect score (~1). Does this mean that this task is focused on performance optimization, or you believe these scores are the result of over-fitting or other factors that will make the result not reproducible?

Thanks in advance,

Kostas and ENdiTY team!

marios mintzis

unread,

Feb 27, 2020, 7:29:57 PM2/27/20

to SIGMOD 2020 Contest Google Group

Good luck getting an answer by the orginizers Kosta

alaska benchmark

unread,

Feb 28, 2020, 11:32:42 AM2/28/20

to SIGMOD 2020 Contest Google Group

Hello Kostas,

Thank you for your appreciation, we hope you will enjoy the competition :)

1) On the hidden dataset, which you will evaluate on our solutions, will the camera specifications change along with the labels dataset provided? Or you will provide only a different labels dataset?

The camera dataset available on the website (that is, dataset X) is the full dataset with all specifications (29'787). The "hidden dataset" (that we call "evaluation dataset" in the website) is composed by manually labelled matching and non-matching pairs of a subset of specifications of the full dataset (dataset X). We provided a portion of this evaluation dataset ("Medium Labelled Dataset", a.k.a. "Dataset Y"). The dataset X will not change during the contest; we will just provide you (on March 1st) with a larger part of the evaluation dataset (which will include the "Medium Labelled Dataset"), a.k.a. "Large Labelled Dataset".

2) Can you provide any information about the size of the hidden dataset? Specifically, number of specifications and number of labels!

The evaluation dataset contains ~150k matching pairs (pairs of specifications which refer to the same product) and ~7M non-matching pairs (pairs of specifications which don't refer to the same product). The total amount of unique specifications involved in the labelled pairs is 3'865.

3) What is the main focus of this task?

Notice that the solution must not necessarily be a machine learning algorithm, you can exploit rule-based systems or any other kind of approach you prefer. The challenge has a twofold focus: on the one hand finding the bast approach in terms of F-Measure for Entity Resolution, which is the main aspect. On the other hand, also efficiency is important, as ties will be broken based on running time for creating the solution file.

The final results of the leaderboard will declare the finalists of the competition. The final stage, in which we will reproduce your code on our machines, is used to:

verify if your code actually produces comparable results to those you provided in the submissions (± 0.05 difference in F-measure) and is actually an automated process (i.e. it does not simply take results from an external CSV);
verify the most time-efficient system (the one that will produce the output in less time) in case of a draw between two or more participants.

We are planning to ask the top participants to submit their code before the contest final deadline and to publish the running times on the leaderboard in order to allow you to work also on the running times of your solution and compare to others.

Best regards,

Donatella, Andrea, Maurizio, Federico - Programming Contest Co-Chairs

kos...@alcinia.net

unread,

Mar 1, 2020, 11:02:52 AM3/1/20

to SIGMOD 2020 Contest Google Group

Thanks you the in depth response!

... and is actually an automated process (i.e. it does not simply take results from an external CSV);

What about auxiliary data? Could we use a list of camera manufacturers?

What about task specific information? Could we implement the extraction of camera specific features (ie resolution, brand etc) ?

alaska benchmark

unread,

Mar 1, 2020, 11:35:39 AM3/1/20

to SIGMOD 2020 Contest Google Group

Hello Kostas,

sure, you can use auxiliary data and you can also extract features from the specifications. The important thing is that these data are provided to us for reproduce your code.