Handling spurious or invalid datasets during experimental session (data curation)

2 views
Skip to first unread message

Allan Pinto

unread,
May 7, 2026, 2:17:48 PM (6 days ago) May 7
to icatgroup
Dear all,
I hope you are all doing well!

I would like to know whether you have any mechanism or procedure for collecting user feedback on which parts of the data produced during an experimental session should be sent to the catalogue. I am thinking of a data curation procedure to identify which datasets should be preserved, catalogued, remeasured, or discarded.

I am asking because users may generate data that later needs to be remeasured for several reasons, such as sample misalignment, acquisition problems, or other experimental issues. In such cases, this spurious or invalid data could potentially be discarded without compromising the overall experiment.

My impression is that some facilities choose to catalogue everything by default, but I am not sure whether this is the most common practice. I think that BLISS, for instance, has some mechanisms for users to manage transient data, i.e., data living between acquisition and cataloguing, but I do not know exactly whether, or how, this kind of feedback is collected from users.

Have you dealt with this kind of issue before? If so, could you please share how your facility handles the distinction between valid data, data to be remeasured, and data that should not be preserved or catalogued? I would be very grateful if anyone could share information, experiences, or suggestions on this matter.

Best regards,
Allan Pinto

Marjolaine Bodin

unread,
10:34 AM (6 hours ago) 10:34 AM
to Allan Pinto, icatgroup

Hello Allan,

Thanks for your message, this is indeed an important topic for us as well.

At the moment, our policy is to catalogue everything by default. The only exception is data placed under a specific "nobackup" folder, which is excluded from cataloguing and therefore not preserved in the catalogue.

In practice, cataloguing everything does not represent a significant cost for us; the main constraint is actually storage, rather than the cataloguing step itself.

We are currently thinking about a possible solution to allow tagging of datasets as "good" or "bad" (or similar quality/selection flags). In that approach, even "bad" data would still be catalogued, as this information would initially come from the user via the data portal for traceability. As a first implementation, this would be handled through the data portal, with a later evolution toward automated workflows to detect such cases.

One challenge we also foresee is that within a single dataset containing multiple scans, only a subset of scans might be considered invalid or worth discarding. This makes the granularity of any tagging or curation mechanism more complex than a simple dataset-level decision.

We would like to work on this in the coming months, and we would be happy to continue the discussion.

All the best,

  Marjolaine



--
You received this message because you are subscribed to the Google Groups "icatgroup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icatgroup+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/icatgroup/267eb9e6-fc4a-4b06-9ca6-af0fbb929062n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages