Answer: How are you using LLMs in your SearchResearch?

29 views

Skip to first unread message

Dan Russell

unread,

Jan 17, 2024, 5:14:05 PMJan 17

to searchresearch-we...@googlegroups.com

Wednesday, January 17, 2024

Answer: How do you use LLMs in your SearchResearch?

So....

P/C Dalle3. [an evocative picture of data, data tables, line charts, histograms]

It looks like many of us are using LLMs (Bard, ChatGPT, etc.) to ask SRS-style questions, especially ones that are a little more difficult to shape into a simple Google-style query.

That's why I asked this week:

1. How have you found yourself using an LLM system (or more generally, any GenAI system) to help solve a real SearchResearch question that you've had?

I've also gotten LLM help in finding answers to SRS questions that I couldn't figure out how to frame as a "regular" Google search. Bard has answered a couple such questions for me (which I then fact-checked immediately!). While the answers might have a few inaccuracies in the answer, the answers are often great lead-ins to regular SRS.

For instance, I asked Pi (their home page) a question that has been in the news recently: "What's the percentage of homeless people in the US?" Here's what it told me:

My first reaction was the obvious one--that can't possibly be correct! 18%???

The strangeness of the answer drove me to do a regular Google search for some data. At the Housing and Urban Development website I found this: "The Department of Housing and Urban Development (HUD) counted around 582,000 Americans experiencing homelessness in 2022. That's about 18 per 10,000 people in the US, up about 2,000 people from 2020."

Or, if you do the math, that's around 0.18% of the US population.

When you look at multiple sites, you'll keep seeing that figure "18 per 10,000 people," which I could imagine an LLM rendering as "18%" in some twisted way. Arggh!!

Interestingly, when I did this on Bard and ChatGPT on the same day, they ALSO quoted 18% as the homeless rate.

However... if you ask all of the LLMs this same question today (1 week after they told me it was 18%), they all get it right: the real percentage is 0.18%.

That's great... but also worrying. I'm glad they're in alignment with more official sources, but the underlying data didn't change--some training must have kicked in.

You can't count on consistency with an LLM.

But... I find myself increasingly turning to LLMs to answer questions that really are difficult to frame as a search query. Previously I wrote about finding "Other words that are like cottagecore" as an example of the kind of task that an LLM can do well. They can take fairly abstract language ideas (e.g., "what words are like these words...<insert list>") and come up with something fairly useful. This is a great capability that will let us take SearchResearch questions in new directions.

You can ask follow-up questions without having to include everything you've found out thus far. THAT is a pretty big advantage, especially for more complex tasks that take more than 2 or 3 queries to answer.

This past week I found a really interesting paper on our topic by some folks at Microsoft Research. In their Arxiv paper, "Comparing Traditional and LLM-based Search for Consumer Choice: A Randomized Experiment" they compared ordinary search tasks where some people used a "traditional" search engine, while another set used an "LLM-based" search engine. (The LLM was ChatGPT and was primed with an extensive prompt before people started their searching.)

Intriguingly, they found a couple of interesting things:

* people spent less time doing LLM search than trad search,
* they did fewer, but more complex, searches that trad search,
* on the whole, the two conditions were just about as accurate (but see below),
* the searcher's perceptions of accuracy were pretty high, even when the answers weren't great.

They then did a nice intervention to see if they could color-code uncertain results in the LLMs output to see if it would help make everything more accurate. The result? Yes, significantly so. (Although they still had to have a magic oracle know that the results were low-quality!)

However, one of the Challenges they had their participants do was intentionally a hard question that they knew the LLM would give terrible answers. But it didn't matter much--people believe LLMs beyond what they should.

That's not a huge surprise, but a good cautionary note for us. You still need to check to make sure that the results you're getting actually jibe with reality.

So... what do I use LLM search for these days? I'm asking it a lot of knowledge summarization questions. Things like "Give me an overview of how epigenetics works in humans. Tell me assuming I have a college level degree of genetics knowledge."

With this prompt, ChatGPT does a pretty good job. Here's the opening paragraph of its response:

Epigenetics is a fascinating field of study that explores the mechanisms by which gene expression is regulated without altering the underlying DNA sequence. It plays a crucial role in development, cellular differentiation, and the adaptation of organisms to their environment. Here's an overview of how epigenetics works in humans, tailored to someone with a college-level understanding of genetics...

On the other hand, if you ask for an explanation at a 6th grade level of biology knowledge, you'll see:

Alright, let's imagine our body is like a huge library. This library (our body) has lots of books (cells), and each book has the same set of instructions or recipes (DNA) for making you, well, you! But even though all the books have the same recipes, not all recipes are used in every book. Some recipes are for making eye color, some for hair color, and others for how tall you'll be. Now, epigenetics is like having special bookmarks or notes in those books that decide which recipes are going to be used or ignored....

These kinds of summaries are incredibly useful, and much more sophisticated than the abstracts / snippets beneath each web result on the SERP. The ability to ask for more-or-less detail in the prompt is incredibly useful.

On the other hand... you still have to be careful about the simplest of things.

If you ask Bard [ how many cups are in a gallon? ] it will reply: "There are 16 cups in a gallon. This applies to both US liquid gallons and imperial gallons, although they have slightly different volumes."

Of course, that doesn't make sense. An imperial gallon is 1.2 US gallons, so they can't have the same number of cups! This is a classic measurement blunder: there are, in fact, 16 Imperial cups in an Imperial gallon. (For the record: There are 19.2 US cups in an Imperial gallon.) As always, check your facts... and check your units! (And for the record, ChatGPT explains this: "However, it's worth noting that the imperial cup and the US cup are not the same in terms of volume: 1 Imperial gallon = 16 Imperial cups, 1 Imperial cup = 10 Imperial fluid ounces.")

SearchResearch Lessons

1. It's worth reiterating that you shouldn't assume that the output of an LLM is accurate. Think of it as incredibly handy and useful, but NOT authoritative. Tattoo that on your hand if you need to, but never forget it.

2. Always double check. As we saw, LLMs will make bone-headed mistakes that sound good... so since you're not assuming that the output of an LLM is accurate, make sure you double-or-triple source everything. Do it now, more than ever.