Introducing FedHarv

55 views
Skip to first unread message

Pascal Calarco

unread,
Mar 27, 2026, 3:48:26 PMMar 27
to DSpace Community
Hi folks,

I am releasing a set of Python scripts I have been working on since last late November called FedHarv (short for federated harvesting). Its available now publicly under an AGPL v.3 license for all to use, modify and build upon, provided it stays as free and open source software.


FedHarv is a sophisticated, production-ready federated harvester for open access academic content, designed to automatically discover, enrich, and harvest scholarly articles with PDF availability from multiple sources. 

The problem we are trying to provide a solution for is to to the extent possible, identify Creative Commons-licensed scholarly works (journal articles, letters to the editor, retractions, errata, book chapters, conference proceedings, and open access books) that are authored by researchers, faculty and students of an institution of higher education or research, harvest the metadata and associated PDF from a variety of API services. Where we can't find a non-paywalled version, we use Unpaywall 
to identify author manuscripts and preprints that can be deposited.

The script then provides these metadata and PDFs in a series of folders for the repository manager to quickly check (for departmental and institutional affiliation and CC license correctness), package these up into Simple Archive Format (SAF), ready for batch ingest into DSpace institutional repositories.

The harvester isn't perfect and you should still check to make sure closed or bronze OA items were not harvested in error, but the author has made every effort to do so and has encountered few such errors after much iteration over this.

With this tool, you'll be able to gather together as much of the Open Access scholarly works that your community has formally written and legally deposit these into your organization's institutional repository. If you find this software useful, please drop me an email! 

## 🤖 AI Assistance & Authorship Disclosure

**FedHarv** was designed, architected, and verified by **Pascal Calarco**.

During the development process, AI-augmented coding tools (Google Gemini and GitHub Copilot) were utilized to:
* Generate boilerplate code and initial function structures.
* Refactor logic for performance (e.g., implementing multi-threading).
* Assist with documentation, licensing (AGPL-v3), and testing suites.

All AI-generated suggestions have been manually reviewed, tested, and integrated by the author to ensure technical accuracy,
scholarly metadata standards, and adherence to best practices in library and information science.

All best wishes,

Pascal



 

 

Pascal Calarco¦ Scholarly Communications Librarian and Systems Librarian

Lead, Discovery Team

Research & Publishing Services Unit

Librarian IV

University of Windsor ¦ J. Francis Leddy Library
401 Sunset Avenue ¦ Windsor, Ontario   N9B 3P4
(519)-253-3000 ¦ leddy.uwindsor.ca

 


 

The University of Windsor is situated on the traditional territory of the Three Fires Confederacy of First Nations: the Ojibwa, the Odawa, and the Potawatomi.

 

Join the fight for post-secondary education at Education2025.ca.  

Fatih Güneş

unread,
Apr 1, 2026, 6:26:29 PM (10 days ago) Apr 1
to DSpace Community
Hi Pascal,
Thank you very much for sharing your work. I have been maintaining nearly 12 dspace instances in different univs and I am trying to standardize my automation scripts for ingesting items into dspace using REST APIs. Your project is definitely getting my attention. I will give it a try very soon. One question: Have you considered supporting Entity Model for your project?
Best regards,
Fatih

Pascal Calarco

unread,
Apr 1, 2026, 11:16:25 PM (10 days ago) Apr 1
to Fatih Güneş, DSpace Community
Hi Fatih,

Thank you for reaching out. A bit of context before I address your question.

My Institution ist Part of a DSpace consortium in Canada, known as Scholaris. I wanted to release this now because research funders in Canada will soon release a revised immediate (0 day) open access publication policy, and I expect many of our researchers will take advantage of the transformative agreements our library has with publishers (Plan S) and so much more of Canadian scholarly output will be available under a Creative Commons license. We need a tool to help manage this incoming deluxe, as the policy strongly encourages deposit in IRs.

Yes, we have a group working on Configurable Entities, which we expect to implement after we upgrade to DSpace 9, this summer.

It would make much sense to evolve FedHarv to use Configurable Entities once in place, certainly for authors and departments. ORCID integration also opens up with this release, which Is like to incorporate as well.

All best regards,

Pascal

--
All messages to this mailing list should adhere to the Code of Conduct: https://lyrasis.org/code-of-conduct/
---
You received this message because you are subscribed to the Google Groups "DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dspace-communi...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dspace-community/c57cb85e-12c4-4032-a392-fc5b465dd6ban%40googlegroups.com.

Fatih Güneş

unread,
Apr 2, 2026, 4:32:39 AM (10 days ago) Apr 2
to DSpace Community
Pascal, thank you very much for sharing the roadmap. I got a Ds Repo Anadolu Univ Dspace which I ingested the content using OpenAlex API. So all items have OpenAlex work id. Can I use FedHarv to harvest and upload bitsteams to the items?
Is there a specific configuration to use it only for pdfs/bitstreams?
Is there a command to use it for this purpose?
Or should I fork the project and modify to use it this way?

Best regards,
Fatih.

Kevon Muhoozi

unread,
Apr 2, 2026, 8:45:07 AM (10 days ago) Apr 2
to Fatih Güneş, DSpace Community
Good day, all!
Pascal's Harvesting Python scripts are a very interesting project that I appeal to us all to possibly carry on by supporting, contributing to the repo, and growing it to more sophistication.
Like we may open a conversation on GitHub about what we would like to see in the repo scripts functionality, and also maybe post their the inquiries and queries to make it a growing open source tool and project.

Thank you. Regards

Pascal Calarco

unread,
Apr 2, 2026, 9:57:34 AM (10 days ago) Apr 2
to Fatih Güneş, DSpace Community
Hi Fatih,

I wasn't aware one could load only bitsteams in the web batch ingest. That's a useful addition, I think, providing the user with more options.What is the match point between the bitstream and the metadata record? I actually have another collection of 3400 graduate major papers where I have MARC21 records and want to programmatically attach the bitstream to them. I suppose one exports the metadata-only records from DSpace and then builds SAF packages which overlay the existing metadata record?  I suppose one could add to the config.ini which 'mode' FedHarv would run in -- metadata and bitstream harvesting and packaging or adding bitsreams only, or it could complement the main script as a stand-alone secondary script that could be added to the main branch. I'd welcome either, if you want to collaborate.

All best wishes,

Pascal

Pascal Calarco

unread,
Apr 2, 2026, 10:03:16 AM (10 days ago) Apr 2
to Kevon Muhoozi, Fatih Güneş, DSpace Community
Hello Kevon,

Yes, I would welcome that and it would be fun and useful to collaborate.

All best wishes,

Pascal

Reply all
Reply to author
Forward
0 new messages