VistA & RPMS archive + AI-friendly search index

46 views
Skip to first unread message

Owen Barton

unread,
Apr 20, 2026, 5:48:11 PM (2 days ago) Apr 20
to hard...@googlegroups.com
Hi all,

I demoed this briefly at the WorldVistA meeting and have finally got this to a state that I think is complete and useful enough to share.

This is a ~2 TB archive of VistA and RPMS material with a vector search index and MCP server - it's hosted by CivicActions and freely available to anyone working on VistA or RPMS in government or open source.

What's in it:
  • Mirrors of the VA VDL, IHS RPMS site/FTP, WorldVistA sites (including FOIA), HardHats, Nancy's VistA Server, VistApedia, and all the WorldVistA GitHub repos: about 6.3 million files.
  • All of the above indexed into Qdrant as ~9.6 million chunks across four collections: vista, vista-source, rpms, rpms-source.
  • All documents, source code, and data are included, including any in archives; documents are converted to markdown, PDFs, and images are OCR'd where needed; MUMPS routines are split at label boundaries; meeting recordings and demo videos are transcribed.
My primary goal is to use this as a data source for guiding functional tests and other code generation, but I hope it is also a useful archive and research resource for others.

I am using a vector search index as it lets you search by meaning rather than exact keywords - a query like "how patient allergies are stored and validated" will surface the relevant chunks of GMRA documentation, the Patient Allergies (#120.8) file references, and related routines, even if none of those exact words appear in your query. It's focused on unstructured text - not a replacement for structured data like Vivian. It uses the open-source tool Qdrant for the index. There isn't a web frontend, but you can query it via a REST API or MCP.

MCP (Model Context Protocol) is an open standard that enables an LLM to call external tools. So if you're using Claude Code, VS Code Copilot, Cursor, etc., the model can search the archive directly while you're working - it can pull the relevant chunks itself and generate summaries without needing to copy and paste between tools. Each response snippet includes the source path, so you can keep track of sources in research, and also so you (or the AI) want to go and read the source document in full context. The index supports filtering by source path (hostname/path) so it can follow breadcrumbs and drill down over multiple requests.

To request access (read-only Qdrant API key + Google Cloud bucket access), please fill out this 1-question form, and I will set you up:

https://forms.gle/BEU58m5ttraSwKFT9

Setup instructions for the MCP client with different AI tools, the full list of collections, the archive script, and the indexing pipeline config are in this repo:

https://github.com/CivicActions/vista-rpms-archive

The archive itself is hosted in a Google Cloud bucket and includes all the files, extracted archives, Markdown, DoclingDocument files, and an export of the Qdrant vector search index data in case you want to run your own index (not too hard; just needs ~20 Gi RAM).

Let me know if you have any questions!

Thanks!
Owen

Sam Habiel

unread,
Apr 20, 2026, 6:00:29 PM (2 days ago) Apr 20
to hard...@googlegroups.com
Bravo!

--
--
http://groups.google.com/group/Hardhats
To unsubscribe, send email to Hardhats+u...@googlegroups.com

---
You received this message because you are subscribed to the Google Groups "Hardhats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hardhats+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/hardhats/CAB7AT9A5h8JkBFcZ3AdzA1ajn59e7uONC9E%3DtiZfxvOA%3Ddj8BA%40mail.gmail.com.

Nancy Anthracite

unread,
Apr 20, 2026, 6:33:23 PM (2 days ago) Apr 20
to hard...@googlegroups.com, Sam Habiel

WOW!! Thanks.  I hope LLMs love it.


--

Nancy Anthracite


On Monday, April 20, 2026 6:00:13 PM Eastern Daylight Time Sam Habiel wrote:

> Bravo!

>

> On Mon, Apr 20, 2026 at 5:48 PM Owen Barton <ow...@owenbarton.com> wrote:

>

> > Hi all,

> >

> > I demoed this briefly at the WorldVistA meeting and have finally got this

> > to a state that I think is complete and useful enough to share.

> >

> > This is a ~2 TB archive of VistA and RPMS material with a vector search

> > index and MCP server - it's hosted by CivicActions and freely available to

> > anyone working on VistA or RPMS in government or open source.

> >

> > What's in it:

> >

> >    - Mirrors of the VA VDL, IHS RPMS site/FTP, WorldVistA sites

> >    (including FOIA), HardHats, Nancy's VistA Server, VistApedia, and all the

> >    WorldVistA GitHub repos: about 6.3 million files.

> >    - All of the above indexed into Qdrant as ~9.6 million chunks across

> >    four collections: vista, vista-source, rpms, rpms-source.

> >    - All documents, source code, and data are included, including any in

Christian Caldwell

unread,
Apr 20, 2026, 6:38:01 PM (2 days ago) Apr 20
to hard...@googlegroups.com, Sam Habiel
Thank you, Owen. I’ve set it up with my repo in cursor, and it’s working like a charm. This is amazing work, love it. 

_________________________________________________________________________________________________
Christian A. Caldwell, M.S. | President and CEO
Office of the President
My Brother’s and Sister’s Keeper Colorado (MBSK Colorado)

2500 S. Abilene St. 441659 | Aurora, Colorado 80014
Cell Phone 720-519-9434 | E-mail christian...@mbskco.org


From: hard...@googlegroups.com <hard...@googlegroups.com> on behalf of Nancy Anthracite <nanth...@earthlink.net>
Date: Monday, April 20, 2026 at 4:33 PM
To: hard...@googlegroups.com <hard...@googlegroups.com>
Cc: Sam Habiel <sam.h...@gmail.com>
Subject: [EXTERNAL] Re: [Hardhats] VistA & RPMS archive + AI-friendly search index

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

img-3a2b2822-1429-406d-b3a5-f80c1c318103

Kimball Bighorse

unread,
Apr 20, 2026, 7:25:56 PM (2 days ago) Apr 20
to Hardhats
Thanks, Owen!

Ar “art”

unread,
Apr 20, 2026, 7:42:23 PM (2 days ago) Apr 20
to Hardhats
thank  you
Reply all
Reply to author
Forward
0 new messages