Big Publishers Selling Authors' Works for LLM Training

88 views
Skip to first unread message

Jerry Harris

unread,
Sep 29, 2024, 2:24:02 PMSep 29
to Dinosaur Mailing Group
Maybe this isn't news to some, but it was to me, and it gives me (personally) another reason to not publish with these corporations ever again. Quote below from Mastodon by John Carlos Baez and Blake C. Stacey [comments in brackets mine]:

"Academic publisher Taylor & Francis [publisher of _Journal of Vertebrate Paleontology_, _Historical Biology_, and others] recently sold many of its authors’ works to Microsoft for $10 million, without asking or paying the authors—to train Microsoft’s large language models!

Taylor & Francis asked their journal "Learning, Media and Technology" to cut peer review time to 15 days—absurdly little time—to crank out more content.

And Taylor & Francis's subsidiary Routledge told staff that it was “extra important” to meet publishing targets for 2024. It moved some book deadlines from 2025 to 2024. Why? To meet its deadline with Microsoft.

Another academic publisher, Wiley [publisher of many, MANY journals that include vert-paleo-relevant papers], made a $44 million deal to feed academic books to LLMs — with no way for authors to opt out. They say “it is in the public interest for these emerging technologies to be trained on high-quality, reliable information.”

When you publish with one of the big academic publishers, they try to make you sign a contract saying they can do whatever they want with your work. That means anything."

Sources: https://pivot-to-ai.com/2024/08/04/more-academic-publishers-are-doing-ai-deals/  and  https://pivot-to-ai.com/2024/09/28/routledge-nags-academics-to-finish-books-asap-to-feed-microsofts-ai/

Thomas Richard Holtz

unread,
Sep 29, 2024, 3:24:48 PMSep 29
to DinosaurMa...@googlegroups.com
"They say “it is in the public interest for these emerging technologies to be trained on high-quality, reliable information.”"

Except that isn't how LLMs work. There is nothing in their algorithms to prioritize truthfulness in their products: they are simply human language emulators. The emphasis is on the natural flow of the text, not the accuracy of their statements.


--
You received this message because you are subscribed to the Google Groups "Dinosaur Mailing Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to DinosaurMailingG...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/DinosaurMailingGroup/b61d2409-59eb-4af7-9f87-9578975e6d20n%40googlegroups.com.


--

Thomas R. Holtz, Jr.
Email: tho...@umd.edu         Phone: 301-405-4084
Principal Lecturer, Vertebrate Paleontology

Office: CHEM 1225B, 8051 Regents Dr., College Park MD 20742

Dept. of Geology, University of Maryland
http://www.geol.umd.edu/~tholtz/

Phone: 301-405-6965
Fax: 301-314-9661              

Faculty Director, Science & Global Change Program, College Park Scholars

Office: Centreville 1216, 4243 Valley Dr., College Park MD 20742
http://www.geol.umd.edu/sgc
Fax: 301-314-9843

Mailing Address: 

                        Thomas R. Holtz, Jr.
                        Department of Geology
                        Building 237, Room 1117

                        8000 Regents Drive
                        University of Maryland
                        College Park, MD 20742-4211 USA

Raven Amos

unread,
Sep 29, 2024, 3:53:05 PMSep 29
to DinosaurMa...@googlegroups.com
Thank goodness someone else is sounding an alarm about this and this happens to be a subject I have acquired a lot of rather arcane knowledge about. 

People on the art side of paleontology have already been feeling the squeeze and effects of what happens when you pivot to using Large Language Models and text-to-image generators over actual thinking human beings. Over the past year, myself and several other artists who have tried to explain things to scientific journals and editors have been muted or blocked outright for our troubles, including the Journal of Integrative Organismal Biology. I’m still blocked by them for asking what species of bird was supposed to be depicted in a press release image that was obviously generated by a text-to-image generator and had things like extra toes, nonsensical labels pointing at nothing - a genetics researcher also pointed out that a double helix image depicted in the press release also made no sense. 

There’s another more infamous paper released and later retracted in Frontiers in Cells, which left people wondering how on earth it got past peer review because the image in the paper depicted a rodent with comically large genitalia and a label pointing to it that said “DCK”. There’s a writeup here in Ars Technica: https://arstechnica.com/science/2024/02/scientists-aghast-at-bizarre-ai-rat-with-huge-genitals-in-peer-reviewed-article/

There are so many other examples I could give - even paleontologists themselves are starting to use this mess and blocking or ignoring people who raise an alarm, because on the “text-to-image” side of things, artists were likewise not given an opportunity to opt out, and when some tried to contact LAION, the “non-profit” that’s responsible for creating the major datasets used by text-to-image generators, to get their copywritten works removed, they were countersued for “false copyright claims”. Now people like Altman and Zuckerberg are saying “individual creators’ work is not that valuable”, but it was valuable enough to steal works en masse from places like RedBubble product photos featuring artwork (as was my case) as a way to confound copyright claims! And because this particular group is in Germany, they’re not beholden to US copyright laws. In fact, Germany has said “yeah, keep stealing, it’s fine with us”: https://the-decoder.com/german-court-allows-non-profit-laion-to-scrape-copyrighted-images-for-ai-training/

These apps cannot ’think” or distinguish right from wrong. In fact, the very FOUNDATION of “training” these apps includes swathes of child sexual assault materials in the datasets, as was found by Stanford last year: https://www.404media.co/laion-datasets-removed-stanford-csam-child-abuse/ - in fact, these app creators have known about CSAM in their datasets and that some users have created CSAM with their products since at least 2021: https://luddite.pro/the-lost-penny-files-midjourneys-beginning/

Likewise, the ChatGPT/text-to-text generators are built on $2/hr labor in Kenya, where workers are subjected to the most horrific things the internet has to offer, because that was these company’s idea of “a world of knowledge” - which is giving the same weight to posts on Reddit, 4chan, and Storm Front to anything any of you have produced: https://time.com/6247678/openai-chatgpt-kenya-workers/

There’s even been some instances where people have published books written by ChatGPT about foraging for wild mushrooms. The results are about what you’d expect, too, because again, these machines don’t think, they don’t know right from wrong, and they don’t actually “know” anything. They just spit out plausible-sounding word salads that _are going to get people hurt_: https://www.vox.com/24141648/ai-ebook-grift-mushroom-foraging-mycological-society

This is galvanizing me to finish the podcast I recorded a couple months ago about this very topic. I’ll be sure to share it here.

Mike Taylor

unread,
Sep 30, 2024, 1:57:38 AMSep 30
to DinosaurMa...@googlegroups.com
I am strongly in agreement that LLMs (or "AI" as we seem to have decided to call them) are very dangerous. The problem is that the one thing they really ARE good at is sounding plausible — even when spouting utter bullshit. For one disturbing example, here's a write-up of what happened when I asked ChatGPT to tell me about my own papers: https://svpow.com/2023/04/12/more-on-the-disturbing-plausibility-of-chatgpt/

In summary, every single article it mentioned was wrong: sometimes just by making the title different, sometimes by muffing page-ranges or editor names, sometimes by making up papers that simply do not exist at all. But — and here is the disturbing part — every paper it listed sounds exactly like something I WOULD have written. They would look just fine if inserted in a bibliography ... right up to the point where someone tried to follow one of the references.

In conclusion, m'lud, LLMs are exceptionally plausible bullshitters that consistently say things that bear no resemblance to actual reality, and neither know nor care that this is the case. I do not feel that one of society's problems right now is the lack of such bullshitters.

-- Mike.


Nick Gardner

unread,
Sep 30, 2024, 12:43:04 PMSep 30
to Dinosaur Mailing Group
In this case, they are conceivably using them for training custom LLMs that only are trained on the corpus they have, however, when we see other reports pressuring for expediting the publication of academic books in order to fulfill quotas promised for AI training data sales, we can view this very skeptically. I will say that small training data that is focused will typically produce better LLM outputs (this is discussed in the literature)-- bigger isn't better, the LLM performance scales linearly with training data, etc.

that aside, none of us consented or agreed to this use of our content and it's disgusting how academic publishers have found yet another way to monetize our work which doesn't give any kickbacks or benefits to authors, societies, etc.

Nick

Reply all
Reply to author
Forward
0 new messages