Attribution for voice assistant systems?

Kat Walsh

unread,

Sep 14, 2022, 6:37:52 PM9/14/22

to licenses-m...@creativecommons.org

Greetings! Many of you have seen our revised suggestions for attribution that will be posted to the CC website soon.

But one area where we don't currently have specific guidance to share is voice assistant systems (things like Alexa, Siri, others). Many of these systems draw from CC-licensed sources of information, and this is an area where we want to hear more from you about what a reasonable attribution would look like in practice.

For example, is it reasonable to simply give the name of the source site so that users can navigate to it on their own? What if the material comes from a site that isn't well-known? What if there are multiple sources of information?

-Kat

James Salsman

unread,

Sep 14, 2022, 10:16:38 PM9/14/22

to Kat Walsh, licenses-m...@creativecommons.org

Hi Kat,

I have strong opinions about the questions you asked, but many of which are resolved by the ongoing work at https://meta.wikimedia.org/wiki/Communications/Sound_Logo

which I can't help but think you likely would be satisfied with, at least in the limited case of the Wikipedias.

May I please ask how you see attribution in the context of foundational natural language models such as GPT-2, GPT-3, and LaMDA, all of which incorporate CC licensed materials, including Wikipedias, but are usually allowed to produce output without any attribution? I can show you research finding that these advanced transformer models incorporate an inverted index of the documents on which they are trained, meaning that the original full text is almost always recoverable from their billions and often trillions of neural network weight parameters.

I ask because I've been able to measure exactly which Wikipedia articles these NLP systems want to edit, and how. There was some recent discussion about this at https://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/Archive343?wprov=srpw1_0#Extended_discussion_on_economic_bias_with_GPT-3 and I would strongly suggest studying both of the collapsed excerpts in that section when considering whether and how to provide attribution for GPT-3 and LaMDA output, including output resulting in further CC content development.

Best regards,

Jim Salsman

j...@phoneclearly.com

--
You received this message because you are subscribed to the Google Groups "Licenses Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to licenses-mailing...@creativecommons.org.
To view this discussion on the web visit https://groups.google.com/a/creativecommons.org/d/msgid/licenses-mailing-list/CABhDQTxJkvD-m_WZUNryCv_%3Du-Vbkkrt-kQQM6BoMW1fCgHLSQ%40mail.gmail.com.
For more options, visit https://groups.google.com/a/creativecommons.org/d/optout.

James Salsman

unread,

Sep 16, 2022, 11:50:01 AM9/16/22

to Kat Walsh, licenses-m...@creativecommons.org

Here's a good source describing how these large language models (which
are usually used in the voice assistant systems which may produce
unattributed content) actually contain the full text information of
the documents on which they were trained, which these days almost
always includes the full text of the English Wikipedia:
https://arxiv.org/pdf/2205.10770.pdf -- in particular the first
paragraph of the Background and Related Work section on page 2. It's
fascinating that document extraction is considered an "attack" against
such systems, which may speak somewhat to the understanding of the
researchers that they are involved with copyright issues on an
enormous scale.

On a lighter note, here's what LaMDA had to say about today's
teleconference: https://ibb.co/album/syK3fN
Sorry about the screenshots out of chronological order. The LaMDA beta
doesn't allow copying text....

James Salsman

unread,

Sep 16, 2022, 12:31:47 PM9/16/22

to Kat Walsh, licenses-m...@creativecommons.org

P.S. Here's a good popular treatment, showing how to extract full pages of copyrighted works from GPT-3: https://bair.berkeley.edu/blog/2020/12/20/lmmem/

Reply all

Reply to author

Forward