How good datasets shape state-of-the-art AI models - open Source AI hackathon talk by Abhishek Mishra

31 views
Skip to first unread message

Zainab Bawa

unread,
Jan 8, 2024, 2:35:17 AM1/8/24
to data...@googlegroups.com

Building state-of-the-art-models isn’t as much a proof of having superior architecture as having superior datasets. Everybody says that we need good datasets but nobody clearly defines what good datasets actually mean. In this talk, Abhishek will cover exactly what it means to say a dataset is good, and what goes into curating and refining to build models that are state-of-the-art. He will also discuss the best existing pretraining/SFT datasets in the field of LLMs.
Abhishek will cover the following topics:

  • Challenges with dataset curation or understanding good data.
  • Dataset curation/refining techniques.
  • Desired properties of dataset in current SoTA tracks.
  • Top openly available SFT datasets.

The primary takeaway from this talk is the skillset of curating/refining your own high quality datasets.

This talk is useful for anybody in the industry who is looking to build in-house best models for their tasks or generic state-of-the-art models in the field of LLMs.

The talk will be held online, at 6 PM, via Zoom and YouTube. RSVP to participate - http://has.gy/at1k

Reply all
Reply to author
Forward
0 new messages