Here's a good source describing how these large language models (which
are usually used in the voice assistant systems which may produce
unattributed content) actually contain the full text information of
the documents on which they were trained, which these days almost
always includes the full text of the English Wikipedia:
https://arxiv.org/pdf/2205.10770.pdf -- in particular the first
paragraph of the Background and Related Work section on page 2. It's
fascinating that document extraction is considered an "attack" against
such systems, which may speak somewhat to the understanding of the
researchers that they are involved with copyright issues on an
enormous scale.
On a lighter note, here's what LaMDA had to say about today's
teleconference:
https://ibb.co/album/syK3fN
Sorry about the screenshots out of chronological order. The LaMDA beta
doesn't allow copying text....