Including new datasets in the future?

19 views
Skip to first unread message

Nick Doiron

unread,
May 16, 2021, 2:01:57 PM5/16/21
to gem-benchmark
Hi GEM benchmark team,

I recently processed thousands of questions and answers from /r/AskNYC, a forum for questions from residents and visitors of New York City.

I'm continuing to work on this dataset to improve overall quality and hopefully make a process to remove more toxic or forum-specific answers.
I saw that GEM had a couple of fact-based generation benchmarks for restaurants, and wondered if this dataset could be adapted for your uses, or that could be a future collaboration.
Some issues with the dataset - the questions are long, there are multiple valid answers to the same question (i.e. not just q -> a), copyright could be an issue, and filtering by Reddit votes cannot remove all offensive content.
I have some more info on my GPT-NYC model page

-- Nick Doiron

Sebastian Gehrmann

unread,
May 20, 2021, 9:48:42 AM5/20/21
to Nick Doiron, gem-benchmark
Hi Nick, 

Thank you for reaching out about this. We will start the selection process for the next iteration of GEM right after our workshop in August and you will be able to submit a proposal then. 

In the meantime - if you would like to take advantage of our infrastructure for evaluation and challenge set creation, you could add your dataset to huggingface datasets in a format similar to that of GEM and most of our scripts should run with only minor modifications. 

Best
Sebastian

--
You received this message because you are subscribed to the Google Groups "gem-benchmark" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gem-benchmar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gem-benchmark/bede4532-ff62-42c9-85e7-6606cad1fbd1n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages