Fwd: Data is for life (and not just International Data Week)

7 views

Skip to first unread message

Prash

unread,

Oct 16, 2025, 7:55:48 AM10/16/25

to bioc...@googlegroups.com, Scott C Edmunds via Cassyni

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Data is for life (and not just International Data Week)

This week is International Data Week in Brisbane, and having watched along and participated I've decided to share a brief write-up of what I thought the highlights and themes of the event were.

Scott C Edmunds

Oct 16

READ IN APP

The International Data Week conference in Brisbane has just ended, although the co-located events such as DataCite Connect are continuing for a few more days. Every two years IDW brings together the Research Data Alliance and CODATA meetings (also bringing World Data System into the organisational mix), and while I’ve been to the smaller individual meetings I do prefer the combined meetings to get a bigger (data) bang for my buck. Being privileged to attend in Botswana in 2018, Salzburg in 2023, and virtually attending the (logistically tricky because of COVID) Korean meeting in 2022. Being a nerd for all things data, because of reasons out of my control I was very disappointed in having to pull out of going to Brisbane this year, but as some consolation I could still watch the conference online.

Being in Bangkok the timezone wasn’t as bad as for many of the virtual attendees (around 100 people watching online while 700 were in-person), although I still didn’t have the energy to get up at 6am for the morning plenary talks. Of the remaining sessions I thought it would be useful to try to digest what I saw and provide a brief write-up of what I thought the highlights and themes of the event were. In the closing plenary CODATA President Mercè Crosas presented a wordcloud generated from the submissions this year, and as she noted (as with every conference now) that there was an increase in talks on the topic of AI. This now being roughly equal to FAIR in the times it was mentioned, although there was a lot of mentions of AI-readiness and Interoperability which are also key components of FAIR.

Thanks for reading Scott Edmunds! Subscribe for free to receive new posts and support my work.

Pledge your support

I presented in two sessions on Tuesday, both on the experiments GigaScience had been doing working Machine Learning (ML) standards into the peer review and publishing process. My main talk accepted in the SciDataCon tracks of IDW was in the afternoon “Rigorous, responsible and reproducible science in the era of FAIR data and AI / Infrastructures to Support Data-Intensive Research” session. This covered lots of areas relating to AI, Francis Crawley presenting quite a legalistic talk on the challenges data sovereignty and AI in the European Health Data Space, case studies in FAIR use of AI for climate simulation data, and my particular ML-publishing case study. My mind was most blown by the talk by Lynn Woolfrey from University of Cape Town on the DataFirst African research data service, which has done sterling work over the years providing data rescue services when regime changes in African countries have access to government data. This year they have to step in and rescue data from US government funded projects, and Lynn presented a few examples of the 190 USAID education projects that have had to be moved to secure and stable servers at UCT. Truly heroic work there.

My other talk was presenting a trimmed down version of our ML standards case study in the ‘Bridging the FAIR gap: transforming the long tail of supplementary data & generalist repositories into FAIR datasets” morning track. Despite the title making it sound like the session was about static supplemental files, the session was very AI-centred as well. Thomas Lemberger of EMBO Press showing the latest AI-enhancements to their Source Data platform such as integrating multimodal quality checks and open libraries of other AI datachecks (for example this library for imaging data). I also really liked Slava Tykhonov’s talk on the Croissant metadata format for ML-ready datasets, and it’s integration into Kaggle, HuggingFace, OpenML and Dataverse. You can see the prerecorded video of the shorter version of my case study on youtube, and also read about it in more detail in our preprint.

I was supposed to represent Dryad at the GREI (Generalist Repository Ecosystem Initiative) “Stronger together: Advancing the data repository ecosystem through strategic coopetition” session but not being able to be physically present it was still nice to watch online. Kristi Holmes representing Zenodo did a great talk on GREI’s collaborative framework, and Mark Hahnel representing figshare presented on GREI work on metadata. As a NIH funded project it was no surprise they recommend MESH for subject headings, but if GREI expanded beyond biomedicine it would be interesting to see how well this worked for other disciplines. Matt Buys from DataCite covered the metrics integration and standardisation, and being involved in the Make Data Count project it was good Matt and Kristi both gave MDC best practice recommendations a plug and talked about integration of the MDC/DataCite Usage Tracker that has been very useful to align and standardise metrics across data repositories.

Having the meeting in the Asia Pacific it was great to boost regional representation in the East, but unfortunately some of the Asia-Pacific focused sessions overlapped with each other. I enjoyed the Research data stewardship in the Asia Pacific track as it showed how much librarians have stepped into the data stewardship role in Asia. I particularly liked Jennifer Gu from HKUST Library talk systematically studying Data Availability Statements in their institutions papers. This is a topic of interest as when I was teaching Data Management and Curation at HKU I set my students a similar project to manually quantify reproducibility and compliance of HKU’s open data policies as a literature curation exercise. I’ve previously presented a poster on this at WCRI, and also have the data in Figshare, but this is a nice reminder that I should probably write this up in more detail. One interesting finding from Jennifer’s work is seeing a clearer citation advantage from having a DAS than OA compliance.

Being a combined “bringing together of the data tribes” conference, on top of the more general CODATA SciDataCon tracks RDA runs meetups for their various Working and Interest Groups. I’ve been signed up to many of these over the years, but as the timezones are tricky in Asia my attendance at their calls hasn’t been great so it was nice to catch up with a lot of these efforts. The RDA FAIR for Machine Learning (FAIR4ML) Interest Group chaired by Dan Katz was interesting, and brought together a lot of the ML and FAIR threads and topics brought up elsewhere in the meeting. This was quite an interactive session where the participants got to drill into the current draft of the FAIR for Machine Learning Model and say what was missing and what did and didn’t make sense.

It was good to see from Susanna Sansone’s talk on the last day on how much the FAIRsharing platform has developed since it first launched as Biosharing in 2011. There are growing numbers of third party integrations of this service, including the ELIXIR DSW (Data Stewardship Wizard) that has been the vehicle GigaScience have used to carry out ML annotation and peer review. It was also interesting to hear there are now over 30 FAIR assessment tools out there, and FAIRsharing assists these through providing consistent and well curated policy data to aid standardisation and accuracy. Susanna ended with an update on the Tier2 project investigating reproducibility across research disciplines. GigaScience has acted as a positive control in this wffort, and you should look out for new outputs from this project that are about to be announced (watch this space).

The next IDW in 2027 will be hosted in Cape Town, and I really hope I really won’t miss that one.

References

Akhtar M et al. 2024. Croissant: A Metadata Format for ML-Ready Datasets. In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning (DEEM ‘24). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3650203.3663326

Edmunds SC, Nogoy N, Lan Q, Zhang H, Fan Y, Zhou H, et al. Integrating Machine Learning Standards in Disseminating Machine Learning Research [Internet]. MetaArXiv; 2025. Ava https://doi.org/10.31222/osf.io/y6jh2_v1

Thanks for reading Scott Edmunds! Subscribe for free to receive new posts and support my work.

Pledge your support

Scott Edmunds is free today. But if you enjoyed this post, you can tell Scott Edmunds that their writing is valuable by pledging a future subscription. You won't be charged unless they enable payments.

Pledge your support

Comment

Restack

Prashanth N Suravajhala, Ph.D.

Professor, Systems Genomics Group

Department of Biosciences, Room # 323D, AB-3

Manipal University Jaipur, Dehmi Kalan 303007, India.

Founder, Bioclues.org

Group page: http://www.bioinformatics.org/wiki/Prash

Twitter: @prashbio

"One rule is important in science- only courageous people win " ~ Max Planck

Reply all

Reply to author

Forward

0 new messages