Experience with locally deployed open-weight models in LAMs?

AKB

unread,

Nov 5, 2025, 3:57:07 PM (9 days ago) Nov 5

to ai4lam

Hello everyone,

I'm reaching out to learn about your experiences with deploying open-weight models locally in library, archive, and museum contexts.
As deputy head of IT at the Bavarian State Library, I am starting to explore the field. Therefore I'm interested in hardware recommendations: What infrastructure have you found suitable for running these models effectively?

At first, my main interest would be coding use cases; but later all kinds of data enrichment, improving information retrieval tasks, too.

I'd be grateful to connect with some of you also interested in this.

Best Regards
Andreas

Poley, Christoph

unread,

Nov 6, 2025, 3:00:40 AM (9 days ago) Nov 6

to AKB, ai4lam

Hi,

it is quite difficult, to give an answer that fits to all use cases.

But I can point the way by giving some information about a server we use in research context, e.g. for our DNB AI project [1]. Currently, we use a bare metal machine from Cisco with following parameters (it’s not bloody edge):

- CPU: Intel(R) Xeon(R) Gold 6338T C...@2.10GHz (96 Threads)

- RAM: 1 TB

- HDD: 5TB SSD RAID 5

- GPU: 2x NVIDIA A100 80G

With this backend we can work with models up to 70B like Mixtral-8x7B-Instruct-v0.1 and Llama-3.1-70B-Instruct.

If I could spend money now to a GPU, I would prefer the 2 H100 from NVIDIA.

On FF2025 conference, I’ll

Best regards,

Christoph

[1] https://www.dnb.de/EN/Professionell/ProjekteKooperationen/Projekte/KI/ki_node.html

--
Website: https://ai4lam.org
ai4lam Slack: https://ai4lam.slack.com/
Join ai4lam Slack: https://bit.ly/ai4lam-slack
ai4lam Google Drive: https://bit.ly/ai4lam-drive
---
You received this message because you are subscribed to the Google Groups "ai4lam" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ai4lam+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ai4lam/c3fcd325-f01d-4c83-b005-462cfcd82230n%40googlegroups.com.

Eric Lease Morgan

unread,

Nov 6, 2025, 10:46:46 AM (9 days ago) Nov 6

to ai4lam

> --
> Andreas

Based on my experience, locally deploying large language models requires one or both of two things: 1) a computer with many cores, or 2) a computer with (almost definitely) an NVIDA processor. Why? Because generative-AI is computing intensive and it only really scales if the process is run in parallel.

I have gotten away with a 64-core computer running Linux to do generative-AI, but that was only good for my specific applications. I don't think it would scale to an entire institution. Having a computer with an NVIDA card makes this MUCH more scalable.

Either way, you might consider using OpenWeb UI:

https://openwebui.com/

OpenWeb UI is an open source tool/interface allowing you to run large language model applications on a central computer but also allow many people to use it. Remember, as open source software, you get what you pay for. It works, but it requires practice when it comes to installation and deployment.

Another suggestion is to use Ollama. Ollama is a server that runs on just about any computer:

http://ollama.com

You then install large language models. You then can interact with the server through a Web interface or any number of programming languages. (I use Python.) Ollama now supports "cloud" models. These work exactly like the locally deployed models but they run on Ollama's hardware and the response times very fast.

Once you get this far, I think RAG (retrieval-augmented generation) is the best use-case of large language models in LAM. More specifically, RAG takes search results and then either summarizes them or allows you to address questions posed to them. Simple examples might include:

* search a library catalog and return the
results as a JSON stream with then gets
converted to any number of citations formats

* search a set of EAD files and ask the system
to summarize the results

* given a pile o' plain text files, output the
names of people, places, and/or organizations
mentioned in the files (but this can easily
be done sans the use of generative AI)

Personally, I have created collections of classic literature and entire runs of scholarly journals. I have then posed questions to the collections such as "What is honor?", "How has librarianship changed over time?", or "Who is Ishmael and why should I care?" The responses I get are more than plausible but I never accept them as truth. Instead, the responses are intended discussion points and food for thought.

HTH

--
Eric Lease Morgan
Librarian Emeritus, Hesburgh Libraries
University of Notre Dame

Gabriel Simmons

unread,

Nov 6, 2025, 12:26:47 PM (9 days ago) Nov 6

to ai4lam

Hi Andreas, as Christoph mentioned, hardware requirements will vary a lot depending on your use case.

For that reason, you may be interested in one of many calculators that are available to help size hardware based on usage requirements.

This blog post from VMWare gives a detailed account of the calculations that are involved. The author has built these calculations into a Python tool that you can use locally to estimate performance metrics like Time to First Token (how long it takes to start seeing a response) and Output Token Throughput (how fast the output is generated, in tokens per second).

The LLMTool calculator produces a recommended hardware setup plus configuration for the vLLM inference engine, and is available via an API.

Are you able to share how many people will use the system at the same time (concurrent users)? Do you need real-time responses? For example, do you want to interact with the models in a chatbot interface?

If you are interested in examples of prepackaged hardware setups, I would recommend taking a look at Bizon.

Happy to connect on this if you want to chat.

Best,
Gabriel Simmons
https://interface-research.net/

AKB

unread,

Nov 10, 2025, 2:51:20 AM (5 days ago) Nov 10

to ai4lam

Christoph, Eric and Gabriel,

Thank your very much for your experiences and advices. I will have a look into those.
Christoph, your setup sounds absolutely not small to me: I am currently not looking for a machine to run the 'entire institution' on. It should be a first experiment augmenting software development for a medium sized team, mainly in order to gain experience before investing >20-30k € into 'real' hardware.
For example I had a look into Mac Studio M3Ultra (12k, bad for parallel accesses), Using multiple NVIDIA 3090s (the option of buying used hardware will almost certainly not be on the table for formal reasons ;-) ), a workstation with a 96GB RTX 6000pro (~12k again) - or even starting really small with a Strix Halo.
A big question would also be, if a 70B-Model will be 'intelligent' enough in order to be a real help at work.

@Christoph: Which models and sizes (Parameters, Quantizations) are you using for your automatic cataloguing work?

Best Regards

Andreas

Marcus Winter

unread,

Nov 10, 2025, 9:14:33 AM (5 days ago) Nov 10

to AKB, ai4lam

Hi Andreas,

I agree with others on the list - the hardware requirements will depend on the use case.

You might want to look into suitable inference engines as they can make a big difference in hardware requirements, performance and development effort. Here's a useful review https://arxiv.org/abs/2505.01658

>> if a 70B-Model will be 'intelligent' enough in order to be a real help at work.

Depends on what you want to do. Many applications will use LLMs as part of a retrieval augmented generation (RAG) system providing information from a related knowledge base as part of the prompt. In this case even smaller LLMs can be very effective as they still have sufficient language skills and can supplement their built-in "knowledge" with external information relevant to the problem at hand.

Best wishes,

Marcus

From: ai4...@googlegroups.com <ai4...@googlegroups.com> on behalf of AKB <aki...@gmail.com>
Sent: 10 November 2025 07:42:48
To: ai4lam
Subject: Re: [ai4lam] Experience with locally deployed open-weight models in LAMs?

This Message Is From an Untrusted Sender

You have not previously corresponded with this sender.

--
Website: https://ai4lam.org
ai4lam Slack: https://ai4lam.slack.com/
Join ai4lam Slack: https://bit.ly/ai4lam-slack
ai4lam Google Drive: https://bit.ly/ai4lam-drive
---
You received this message because you are subscribed to the Google Groups "ai4lam" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ai4lam+un...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/ai4lam/007d97f7-cf77-42d5-8eb4-856591492ab2n%40googlegroups.com.

Maximilian Kähler

unread,

Nov 12, 2025, 12:08:41 PM (3 days ago) Nov 12

to ai4lam

Hi Andreas,
I am stepping in for Christoph here. Our experiments at the DNB with LLMs for Subject Cataloguing were run with models up to 70B parameters at float16 precision. With a context length of 15000 Tokens that just fitted our two NVIDIA A100 GPUs (2x80GB RAM).

As Marcus pointed out: in data processing pipelines where you can provide the LLMs with accurate context (e.g. RAG applications), smaller models will suffice.