Should I use Y_i in my program?

Jerry Lu

unread,

Mar 10, 2021, 8:40:55 AM3/10/21

to SIGMOD 21 Contest

Hello! I am a little confused about the task. I know X_i are all the records, and Y_i are labels. Should I use Y_i in my program as labels given to help my program decide on future entity resolutions? Or should I just use Y_i as a reference to check my answers and only use X_i as input? Thanks.

team KAIDONG2008

giovanni.simonini

unread,

Mar 10, 2021, 9:39:23 AM3/10/21

to SIGMOD 21 Contest

Hi, KAIDONG2008 team!

I guess you can see the task in this way:

X_i + Y_i is your labeled data set (Y_i are actually the labels);
You can split X_i + Y_i in train/validation/test (just an example, you decide what to do) and train a classifier
You submit your trained classifier with ReproZip and we will give as input to it a data set Z_i (same format of X_i) to predict its labels (same format of Y_i)

Then, you are free to not use Y_i, or to use a different approach, i.e., not based on a classical Ml classifier.

Jerry Lu

unread,

Mar 10, 2021, 11:04:49 AM3/10/21

to SIGMOD 21 Contest

Thanks for your prompt reply! Does that means that I can

a. Build a model with X_i and Y_i locally, upload the model and a model runner, that only runs the model with Z_i but don't modify the model at runtime

b. Build a model locally, but allow the model to be dynamically updated at runtime on your judge machine?

From your explanation it seems that I can do both, the only constraint is the compute time.

Jerry Lu

unread,

Mar 10, 2021, 3:51:13 PM3/10/21

to SIGMOD 21 Contest

Also, I am wondering whether I am required to use only one model for all datasets, or I can build different models (probably with domain knowledge and some feature engineering) for each dataset, and choose the right model on runtime based on the input file name.

Jerry Lu

unread,

Mar 15, 2021, 2:49:40 PM3/15/21

to SIGMOD 21 Contest

Hello, is anyone there?

giovanni.simonini

unread,

Mar 15, 2021, 3:21:02 PM3/15/21

to SIGMOD 21 Contest

Hi Jerry,

sorry for the delay!

Your submission will not have access to the ground truth at evaluation time (i.e., you cannot retrain a ML model during the evaluation since it will not have access to the new labels). So the case b is not possible if I understood that correctly.

For the rest, we are going to treat your submission as a black box, i.e., what is inside it's up to you.

The only constraints are that: at evaluation time, we give as input Zi datasets (possibly in a shuffled order); the Reprozip bundle should be < 3GB (with any code/models you want).

Jerry Lu

unread,

Mar 15, 2021, 3:45:01 PM3/15/21

to SIGMOD 21 Contest

Thanks for your reply. It seems more like the option a in my description, as we use limited X_i and Y_i (across all datasets) to train a model, then the same model is used to perform entity resolution on all of the whole datasets (all Z_i) in a shuffled manner. So basically we can only have one model, or we can have multiple models but some methods needs to be developed to dynamically choose one model that works the best without looking at the file name?

Also, I am wondering what should be the input file name for my program to use. Previously I assumed to be like "X1.csv", yet it won't give any result in your environment. Should I just use "X.csv" or "Z.csv"? Sorry for the question if it is stupid but I read the instructions multiple times but can not figure out what filename to read from.

Jerry Lu

unread,

Mar 15, 2021, 3:48:44 PM3/15/21

to SIGMOD 21 Contest

Like there is a line saying " Inside your code, it is important to refer to each input dataset Xi using its original name "Xi.csv".", but I am having no luck using that. Is it possible for you to give a simplest code example, that just reads the input files and output nothing?

Jerry Lu

unread,

Mar 15, 2021, 5:30:40 PM3/15/21

to SIGMOD 21 Contest

Or in other words, can we expect all dataset file (X1.csv, X2.csv ...) to be present in the program folder during runtime, and we will need to give multiple output.csv file? Or for each submission our program will be executed multiple times, each time with different input filename, but the output file is always output.csv? How can we know what is the input file for one specific execution? Via stdin? or we iterate through all files in current path that ends with ".csv"?

giovanni.simonini

unread,

Mar 16, 2021, 6:40:32 AM3/16/21

to SIGMOD 21 Contest

"So basically we can only have one model, or we can have multiple models but some methods needs to be developed to dynamically choose one model that works the best without looking at the file name?"

**Yes**, exactly.

giovanni.simonini

unread,

Mar 16, 2021, 6:42:34 AM3/16/21

to SIGMOD 21 Contest

We will soon publish on the website some details about the evaluation process, so these aspects (how to face the future situation involving more data sets) will become clearer.
For the time being, using one data set, please just name it X2.csv.

NOTICE that the toy data set (i.e., X1) is no longer evaluated--you just have X2 now.

Reply all

Reply to author

Forward