Dear everyone,
I am very excited to share with you that the Data in Generative Models: The Bad, the Ugly, and the Greats workshop will be hosted at ICML 2025 in Vancouver (Workshop date to be announced soon). Please find the workshop's details in the Call for Papers below.
=== Building reliable and trustworthy Generative AI must start with high-quality and responsibly managed training data. ===
Sincerely,
Khoa D Doan on behalf of the Organizing Committee!
********************************************************************************
The 2025 Data in Generative Models: The Bad, the Ugly, and the Greats workshop @ ICML 2025 (DIG-BUGS, 2nd Edition)
********************************************************************************
We cordially invite submissions and participation in our “Data in Generative Models (The Bad, the Ugly, and the Greats)” workshop that will be held on July 18th or July 19th, 2025 at the Forty-Second International Conference on Machine Learning (ICML) 2025 in Vancouver, Canada. Website: https://icml2025digbugs.github.io/
Motivation and Topics
Building upon the success of our previous workshop, BUGS@NeurIPS 2023, we will continue to explore the impact of data on AI beyond backdoor attacks. A key enhancement of the new 2025 edition of the workshop is the substantial expansion of the 2023 research themes, focusing on the interaction between Datasets and Generative Models, including large language models, diffusion models, and vision-language models. Examples of research areas include:
Data-Centric Approach to the Safety of Generative Models (general theme): Most research in safe machine learning concentrates primarily on evaluating model properties (Papernot et. al.). However, from a data-centric perspective, we focus on the safety of generative models, such as LLMs and diffusion models. This strategy will create new methodologies and reveal vulnerabilities and potential threats that have not been previously sufficiently recognized by the research community.
Data Memorization in Generative Models: Current approaches to prevent data memorization in generative models propose limited solutions, such as removing, collecting, or augmenting specific data samples (Wen et. al.). This area of research aims to achieve a comprehensive understanding of data memorization, extending the analysis beyond training set distributions and model outputs. This requires us to examine the characteristics of individual data samples that make them susceptible to memorization, focusing on their representations across multiple internal model layers rather than solely on model outputs, as done in previous studies.
Data Contamination in Generative Models: Including test data from downstream tasks and benchmarks in the training data of generative models poses a significant challenge in accurately measuring the models’ true effectiveness. Developing new, effective methods for detecting data contamination within generative models is essential. These innovative approaches should focus on identifying potential contamination at the individual instance level and then use this information to evaluate broader contamination at the dataset level.
Data Verification in Generative Models: This area focuses on the inputs and outputs of generative models. Current methods for identifying training samples and verifying model outputs are typically assessed using academic benchmark datasets and small-scale models, as highlighted in the recent work (Dubiński et. al., Duan et. al.). We aim to bridge the gap by focusing on data verification for large-scale generative models trained on extensive datasets, ensuring privacy protection on a large scale in real-world scenarios.
We welcome submissions related to all aspects of Data in Generative AI, including but not limited to:
Data-Centric Approach to the Safety of Generative Models
Data Memorization in Generative Models
Data Contamination in Generative Models
Data Verification in Generative Models
Data Poisoning and Backdoors in Generative Models
Generative Data for Trustworthy AI Research (e.g., synthetic datasets for security studies, synthetic augmentation for robust models, etc.)
The workshop will employ a double-blind review process. Each submission will be evaluated based on the following criteria:
Soundness of the methodology
Relevance to the workshop
Societal impacts
We only consider submissions that haven’t been published in any peer-reviewed venue, including ICML 2025 conference. We allow dual submissions with other workshops or conferences. The workshop is non-archival and will not have any official proceedings. All accepted papers will be allocated either a poster presentation or a talk slot.
Important Dates (Tentative)
Submission deadline: May 20th, 2025, 11:59 PM Anywhere on Earth (AoE)
Author notification: June 9th, 2025
Camera-ready deadline: June 30th, 2025 11:59 PM Anywhere on Earth (AoE)
Workshop date: TBD (Full-day Event)
Please visit the workshop’s website (https://icml2025digbugs.github.io/) for more details, including the submission instructions.
Organizers
Khoa D Doan | Franziska Boenisch | Adam Dziedzic | Aniruddha Saha | Viet Anh Nguyen | Zhenting Wang | Bo Li | Heather Zheng