Best Practices for Automated Data Ingestion into iRODS (Metadata, Failure Handling, Directory Units)

49 views
Skip to first unread message

PJ Pretorius

unread,
Apr 8, 2025, 6:13:58 AMApr 8
to iRODS-Chat

Hi all,

I'm working on a system that automates the ingestion of scientific files into iRODS, and we're hoping to align with best practices and possibly adopt existing community solutions. I’d really appreciate your guidance on the following key areas — especially if there are recommended tools, workflows, or design patterns already in use.


Our Setup
  • Using Python iRODS Client (python-irodsclient)

  • Files arrive in local directories (e.g., via external systems or sensors)

  • Final storage is in an S3-backed iRODS resource

  • Target zone and user setup already complete (running 4.3.4)


Questions We’d Like to Answer:

1. Flagging data as “ready” for ingestion

What are robust ways to avoid ingesting a file before it's fully written?

  • We're considering using .ready marker files, checksums, or inotify hooks.

  • Any built-in or iRODS-native approaches to solving this?

  • Are there community-standard ingestion pipelines for this kind of flow?


2. Ingesting directories as single logical units

We’d like to ingest entire directories and treat them as a logical group.

  • How does iRODS handle directory-level ingestion vs. individual files?

  • Is there a good approach to tagging a collection with metadata, or ingesting a collection atomically?

  • Would iput -r or batching via PRC (Python) be appropriate?


3. Metadata Extraction and Tagging

We aim to extract metadata from scientific files (e.g., .clk timestamp data) and associate that with the object or collection.

  • Is there a recommended format or schema for metadata ingestion? JSON? AVUs?

  • Is there tooling to help extract/transform metadata and apply it to iRODS objects programmatically?


4. Dealing with Failures (Network/iRODS/API)
  • How do most systems handle:

    • iRODS being temporarily unavailable?

    • Network failures during iput/upload?

    • Partial uploads or mid-transfer terminations?

  • Should we consider retry logic, transaction queues, or staging to local cache?


5. Data Verification
  • Best practice for verifying the integrity of ingested files?

  • Should we be calculating and storing hashes (e.g., SHA256) pre- and post-ingest?

  • Is there a built-in mechanism in iRODS to support this, or do we need custom logic?


Our Goal

We want a robust, automated ingestion pipeline that:

  • Uploads data safely

  • Attaches metadata (at file or collection level)

  • Handles transient issues

  • Allows for audit or re-ingest in case of failure


Any advice, example implementations, or pointers to documentation/tools would be greatly appreciated. If anyone has solved this pattern already, we’d love to learn from your experience before reinventing the wheel 😊

Thank you!

Terrell Russell

unread,
Apr 16, 2025, 6:44:55 PMApr 16
to irod...@googlegroups.com
Hi PJ,

I want to make sure you are aware of the Automated Ingest Framework (Python + Celery + Redis)...

1. Flagging files - we have seen a few different approaches.  Mostly it depends on what your patterns look like.   You could fire up ingestion jobs when a new .flag file appears... and you could monitor for those with inotify or another external tool...  that's probably where I would start.

2. Collections as single logical units - this is not as straightforward.  Generally, the answer is going to be... no.  Files are handled one-by-one due to policy firings... but you can definitely add metadata to collections with the ingest tool as appropriate.  It has the concept of 'client-side policy' where you can do whatever you want in Python for each file being ingested - so you have full programmatic freedom to design a pattern that works for you.   

3. Metadata extraction - iRODS does not currently ship with any extractors.  There are too many file formats for us to try and keep up with and too many scientific domains moving pretty quickly... so we have not attempted to keep up.  There are probably open source extraction/manipulation libraries for most file formats in most languages.  Adding and managing AVUs will be a determination for your scientific and/or management teams.  Unfortunately, there are no standards here as everyone tends to solve their own scenarios with their own patterns.

4. Dealing with failures - iRODS is still relatively dependent on a good network and access to storage.  The synchronous nature of a filesystem leaves little room for retrying before it becomes 'magical' and hard to reason about.  We have had many discussions about adding some asynchronicity to iRODS, but it's a bit all-or-none before it's useful.  I would recommend doing error-handling and retries in your applications or wrappers that you may put around the iRODS experience for non-power users.

5. Data verification - yes, iRODS provides built-in facilities for checksums to be stored per-replica.  These are usually exposed as -k or -K for the iCommands.  If you build applications that talk to iRODS or to the HTTP API, you can turn on checksums for everything... but of course, that will come at the cost of some speed / CPU.  You can run periodic integrity checks via delay server or cron to make sure things have not moved on disk while under management.

Happy to talk more about your design process and planning - in...@irods.org.

Terrell



--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/irod-chat/ed9d39c0-b4d8-4e28-96fe-f29d6262a647n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages