On-computer AI software without limits that chats with docs to give us data / answers

116 views
Skip to first unread message

Nikhil VJ

unread,
Jul 9, 2024, 6:08:35 AM7/9/24
to datameet
Hi folks,

About time we started applying AI in bigger ways to our common data problems, eh? 
I've come across a good starting place, which can work for the folks who want data, have leads, but aren't AI programmers and won't be going down that path.
Sharing what I've found, and inviting you to share from your side also.

---
Common problem statement : 
- We have large dense PDFs and other kinds of documentation that is holding some data or answers that we need, 
- But it's a needle in a haystack problem plus a convertibility problem.
- So we want an AI to dig through all of it and fish out the things we need, plus get it in a format that we can easily use.
- We need proper attribution to source
- Our method of obtaining the data should be replicable.
- We need to do this a lot and we don't have money for paid AI services.

---

GPT4All Install and use


This is a free / open source software that like QGIS we can download an installer, double-click and install it on our system.
Once done, we have a program on our computer that looks like ChatGPT.

Now, one thing to learn : this software is a container. The actual AI is a "model" that we have to download, like we do with plugins in QGIS, to work in this container. And there's many models available - i'll come to this later.

The first thing we have to do after downloading, is Go on "Models" tab, click + Add Model, browse and download one.
You can ignore the ones titled "ChatGPT" etc which have "install" button and a text field for adding an API key. These are 3rd party services, with limits and costs.
Go for the entries with "Download" button instead. Download one model. The GPT4All site quotes some that we can start with.

Then, go to "Chats" section and here you can select the one you downloaded and start a conversation just like we do in chatgpt. Be prepared for slowness in response. This thing is now running on your machine instead of a supercomputer somewhere.
  
After this, go to "LocalDocs" section. Here's where our core problem statement gets acted on. 
Start a collection, add some documents and proceed. It will take some time to do some "embedding", and then will be done/ready.

Now go to Chats section again, and this time, open the "LocalDocs" sidebar on right and check-on your newly setup document collection.
Now you can ask it questions and it'll dig through your documents and then answer based on the docs.

Screenshot:
Screenshot from 2024-07-09 14-01-42.png
In my limited attempt it was even able to tell me a page number it got the answer from. I had to include the instruction in my prompt for it. "Write the doc filename and page numbers for reference."

I think this is a good place to start with. The whole thing happened on my computer, without internet, no usage limits. Took time, but hey, some folks have more time than money.

----
Models market
As I mentioned earlier, there's many models available. That's where you come in.

- I've only just started reading into this stuff, so don't know what differentiates the various models being published by different folks. 
- It's possible that we might get some models doing our specific job very nicely, and others not so much.
- What performs well for one use case, may not do so for another.  
- There's also smaller models published which run better on normal laptops and what they can do is uncharted.
- And then, there's the Settings. There's a ton of technical options (go to "Model" tab in Settings), and it can wildly vary what outputs we get.
- So there's a lot of room for model-shopping and settings fine-tuning here.
- What can help is: different folks trying out different combinations and reporting back what worked and what didn't, with their specific use case.
- So, inviting you to jump in and start tinkering with this, apply it to your work area, and share your experiences.

------

Size vs usefulness of AI models 
(note: don't confuse with the instagram kind of AI models :D ):
- The "everything" AI models which grab the headlines, have ingested all sorts of myriad info like science and history textbooks, past news, TV serial plots, food recipes, coding libraries, religious texts, celebrity gossip, social media noise - you name it. That's what makes them huge and resource-hungry.
- But you might have a very specific use case for which all the other info is utterly useless.
- So, there is an opportunity here : if we can find specific small models that do our specific work, we can do a lot with much lesser resources required.

-----

Bulk use
- For use cases where you have a specific task, are able to make this AI do it, and now want it repeated for several inputs, 
- there's a python library that can be used to do your task in bulk. Link: https://docs.gpt4all.io/gpt4all_python/home.html
- So you can first do exploration with the desktop software, drill down to the specific input params / settings, then do it in bulk with code.
- I've also posted this feature request for a code export feature similar to what's there in Postman : https://github.com/nomic-ai/gpt4all/discussions/2610 

----

More links
The software's github: https://github.com/nomic-ai/gpt4all
Discussions where people might have posted something you're looking for: https://github.com/nomic-ai/gpt4all/discussions
The place where all these open source AI models are published: https://huggingface.co/models


--
Cheers,
Nikhil VJ
https://nikhilvj.co.in

Gunngunn

unread,
Jul 9, 2024, 8:28:05 AM7/9/24
to data...@googlegroups.com
You could have tried notebook.ml
--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAH7jeuPzFzsZ8986TQhpY_R5rBYQ12ngYnWPMxLWmAGYSCTGJQ%40mail.gmail.com.

Nikhil VJ

unread,
Jul 9, 2024, 9:25:31 AM7/9/24
to data...@googlegroups.com
Hi Gunngunn,

Thanks for sharing, but I think you meant Google's NotebookLM ?

I gave it a spin, it's pretty good too.
Doesn't meet all the criteria though - it's not open source / self-hostable, and I was searching more in that zone as I want to do other things downstream as well, like bulk tasks and automation.

But yes I guess it would be good for lighter use cases. Thanks for sharing if this was the one you meant.

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in

To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAKm0_9mg2GDmB6MJ%3Deoi8CVHG0Lx%3DQVs6d5HzpCasrAbjoVbHA%40mail.gmail.com.

Gunngunn

unread,
Jul 9, 2024, 9:46:05 AM7/9/24
to data...@googlegroups.com
Yes sir.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAH7jeuNjLr%2BwP8HV3pwD9Koz2JstE3CjkqnL-GVAL8qLXOuziA%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages