Handling spurious or invalid datasets during experimental session (data curation)

4 views
Skip to first unread message

Allan Pinto

unread,
May 7, 2026, 2:17:48 PMMay 7
to icatgroup
Dear all,
I hope you are all doing well!

I would like to know whether you have any mechanism or procedure for collecting user feedback on which parts of the data produced during an experimental session should be sent to the catalogue. I am thinking of a data curation procedure to identify which datasets should be preserved, catalogued, remeasured, or discarded.

I am asking because users may generate data that later needs to be remeasured for several reasons, such as sample misalignment, acquisition problems, or other experimental issues. In such cases, this spurious or invalid data could potentially be discarded without compromising the overall experiment.

My impression is that some facilities choose to catalogue everything by default, but I am not sure whether this is the most common practice. I think that BLISS, for instance, has some mechanisms for users to manage transient data, i.e., data living between acquisition and cataloguing, but I do not know exactly whether, or how, this kind of feedback is collected from users.

Have you dealt with this kind of issue before? If so, could you please share how your facility handles the distinction between valid data, data to be remeasured, and data that should not be preserved or catalogued? I would be very grateful if anyone could share information, experiences, or suggestions on this matter.

Best regards,
Allan Pinto

Marjolaine Bodin

unread,
May 13, 2026, 10:34:41 AMMay 13
to Allan Pinto, icatgroup

Hello Allan,

Thanks for your message, this is indeed an important topic for us as well.

At the moment, our policy is to catalogue everything by default. The only exception is data placed under a specific "nobackup" folder, which is excluded from cataloguing and therefore not preserved in the catalogue.

In practice, cataloguing everything does not represent a significant cost for us; the main constraint is actually storage, rather than the cataloguing step itself.

We are currently thinking about a possible solution to allow tagging of datasets as "good" or "bad" (or similar quality/selection flags). In that approach, even "bad" data would still be catalogued, as this information would initially come from the user via the data portal for traceability. As a first implementation, this would be handled through the data portal, with a later evolution toward automated workflows to detect such cases.

One challenge we also foresee is that within a single dataset containing multiple scans, only a subset of scans might be considered invalid or worth discarding. This makes the granularity of any tagging or curation mechanism more complex than a simple dataset-level decision.

We would like to work on this in the coming months, and we would be happy to continue the discussion.

All the best,

  Marjolaine



--
You received this message because you are subscribed to the Google Groups "icatgroup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icatgroup+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/icatgroup/267eb9e6-fc4a-4b06-9ca6-af0fbb929062n%40googlegroups.com.

Allan Pinto

unread,
May 19, 2026, 12:01:09 PMMay 19
to icatgroup
Hello Marjolaine,

Thank you very much for your response. It is very helpful to understand how you are currently handling this issue.

I fully agree that keeping even "bad data", or at least its metadata, catalogued can be important for traceability, especially when the information comes from the user or from the experimental context. I can imagine that this could provide useful metadata related to operational efficiency such as recurring cause of remeasurement or beamline-specific bottlenecks.

At Sirius, we are devising a solution towards a similar direction, but with a specific focus on data lifecycle during the period between acquisition and long-term cataloguing, which we are calling of transient datasets. In our case, we are considering the development of a data management layer for this purpose, as our Bluesky-based control system currently has no features to consistently manage this kind of information throughout a long sequence of measurements.

The idea would be to capture, during the experiment, information such as whether a measurement is valid, should be remeasured, should be preserved, or could eventually be discarded. Users already make these decisions during experiments, and capturing this feedback as it happens could make the curation process more efficient, especially when dealing with individual scans or subsets of scans within a larger dataset, which I fully agree is probably the most challenging case. If this information is collected only later, users may no longer remember the decisions they made, or the experimental context that motivated them.

Regarding the interface for collecting user feedback in practice, I agree that the Data Portal could be the best place to do this, since users are already used to interacting with this interface and it would avoid asking them to use yet another system. We are currently discussing this conceptual project, and we expect to start developing a proof of concept soon for a component that will interact with transient data. We are also considering presenting or discussing this idea at NOBUGS.

I would be very happy to share our thoughts and discuss how we could contribute to the ICAT community, especially by thinking beyond our local case at Sirius with Bluesky-based control systems and considering how such a mechanism could support other acquisition environments. Once I have material showing the main idea and architectural design of the solution we are devising, I would be glad to schedule a presentation for anyone interested in this topic.

Best regards,
Allan

Marjolaine Bodin

unread,
May 22, 2026, 7:49:04 AM (12 days ago) May 22
to icatgroup
Hello Allan,
Thank you for the detailed perspective from Sirius, it is very interesting to see this direction.
I will also be attending NOBUGS, so it would be a great opportunity to continue the discussion there. I would also be interested in a presentation on this topic.
All the best,
  Marjolaine

Reply all
Reply to author
Forward
0 new messages