Hey!
Currently, we have to connect to a locally running process with the llama2 model.
I was doing my research during the last few days on how to set up offscreen and web workers to use the llama 7B model with the power of web GPU
But unfortunately, I don't have too much expertise in the field. So I don't understand properly how to set up a good architecture which satisfies those criteria:
1) LLM model (let's say around 4GB) is loaded only once from the internet and is saved on the computer/browser
2) Workers/offscreen scripts can run the LLM processes and they have access to web GPU and perform computation efficiently
I really want to push forward the direction of having open source personal LLM models for everyone. And in future, also fine-tune it.
So I ask for help!
If anyone has a good example or knowledge of how to make it work, any resources to to set up architecture properly and efficiently, then I kindly ask to share!
Thanks for your time
Robert