Introducing FedHarv

10 views
Skip to first unread message

Pascal Calarco

unread,
Mar 27, 2026, 3:48:26 PM (3 days ago) Mar 27
to DSpace Community
Hi folks,

I am releasing a set of Python scripts I have been working on since last late November called FedHarv (short for federated harvesting). Its available now publicly under an AGPL v.3 license for all to use, modify and build upon, provided it stays as free and open source software.


FedHarv is a sophisticated, production-ready federated harvester for open access academic content, designed to automatically discover, enrich, and harvest scholarly articles with PDF availability from multiple sources. 

The problem we are trying to provide a solution for is to to the extent possible, identify Creative Commons-licensed scholarly works (journal articles, letters to the editor, retractions, errata, book chapters, conference proceedings, and open access books) that are authored by researchers, faculty and students of an institution of higher education or research, harvest the metadata and associated PDF from a variety of API services. Where we can't find a non-paywalled version, we use Unpaywall 
to identify author manuscripts and preprints that can be deposited.

The script then provides these metadata and PDFs in a series of folders for the repository manager to quickly check (for departmental and institutional affiliation and CC license correctness), package these up into Simple Archive Format (SAF), ready for batch ingest into DSpace institutional repositories.

The harvester isn't perfect and you should still check to make sure closed or bronze OA items were not harvested in error, but the author has made every effort to do so and has encountered few such errors after much iteration over this.

With this tool, you'll be able to gather together as much of the Open Access scholarly works that your community has formally written and legally deposit these into your organization's institutional repository. If you find this software useful, please drop me an email! 

## 🤖 AI Assistance & Authorship Disclosure

**FedHarv** was designed, architected, and verified by **Pascal Calarco**.

During the development process, AI-augmented coding tools (Google Gemini and GitHub Copilot) were utilized to:
* Generate boilerplate code and initial function structures.
* Refactor logic for performance (e.g., implementing multi-threading).
* Assist with documentation, licensing (AGPL-v3), and testing suites.

All AI-generated suggestions have been manually reviewed, tested, and integrated by the author to ensure technical accuracy,
scholarly metadata standards, and adherence to best practices in library and information science.

All best wishes,

Pascal



 

 

Pascal Calarco¦ Scholarly Communications Librarian and Systems Librarian

Lead, Discovery Team

Research & Publishing Services Unit

Librarian IV

University of Windsor ¦ J. Francis Leddy Library
401 Sunset Avenue ¦ Windsor, Ontario   N9B 3P4
(519)-253-3000 ¦ leddy.uwindsor.ca

 


 

The University of Windsor is situated on the traditional territory of the Three Fires Confederacy of First Nations: the Ojibwa, the Odawa, and the Potawatomi.

 

Join the fight for post-secondary education at Education2025.ca.  

Reply all
Reply to author
Forward
0 new messages