Hi all,
I'm working on a system that automates the ingestion of scientific files into iRODS, and we're hoping to align with best practices and possibly adopt existing community solutions. I’d really appreciate your guidance on the following key areas — especially if there are recommended tools, workflows, or design patterns already in use.
Using Python iRODS Client (python-irodsclient)
Files arrive in local directories (e.g., via external systems or sensors)
Final storage is in an S3-backed iRODS resource
Target zone and user setup already complete (running 4.3.4)
What are robust ways to avoid ingesting a file before it's fully written?
We're considering using .ready marker files, checksums, or inotify hooks.
Any built-in or iRODS-native approaches to solving this?
Are there community-standard ingestion pipelines for this kind of flow?
We’d like to ingest entire directories and treat them as a logical group.
How does iRODS handle directory-level ingestion vs. individual files?
Is there a good approach to tagging a collection with metadata, or ingesting a collection atomically?
Would iput -r or batching via PRC (Python) be appropriate?
We aim to extract metadata from scientific files (e.g., .clk timestamp data) and associate that with the object or collection.
Is there a recommended format or schema for metadata ingestion? JSON? AVUs?
Is there tooling to help extract/transform metadata and apply it to iRODS objects programmatically?
How do most systems handle:
iRODS being temporarily unavailable?
Network failures during iput/upload?
Partial uploads or mid-transfer terminations?
Should we consider retry logic, transaction queues, or staging to local cache?
Best practice for verifying the integrity of ingested files?
Should we be calculating and storing hashes (e.g., SHA256) pre- and post-ingest?
Is there a built-in mechanism in iRODS to support this, or do we need custom logic?
We want a robust, automated ingestion pipeline that:
Uploads data safely
Attaches metadata (at file or collection level)
Handles transient issues
Allows for audit or re-ingest in case of failure
Any advice, example implementations, or pointers to documentation/tools would be greatly appreciated. If anyone has solved this pattern already, we’d love to learn from your experience before reinventing the wheel 😊
Thank you!
--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/irod-chat/ed9d39c0-b4d8-4e28-96fe-f29d6262a647n%40googlegroups.com.