Is AI running out of training data?

79 views
Skip to first unread message

John Clark

unread,
Dec 12, 2024, 1:00:58 PM12/12/24
to extro...@googlegroups.com, 'Brent Meeker' via Everything List
The number of "tokens" (words or parts of words) used to train LLMs is 100 times larger than it was in 2020, the largest are now using tens of trillions.  if you only consider text then the entire Internet only contains about 3,100 trillion tokens. The amount of text LLMs train on is doubling every year but the amount of human generated text on the Internet is only growing at about 10% a year, if that trend continues AIs will run out of text somewhere around 2028.  Does that mean AI progress is about to hit a wall? I don't think so for the following reasons:

For one thing, because of improvements in algorithms, the computing power needed for a Large Language Model  to achieve the same performance has halved about every 8 months. 



And computer chips specialized for AI rather than general computing, like those made by Nvidia and other companies, are getting faster even more rapidly than Moore's Law. Also, the rate of growth of specialized data sets, such as astronomical and biological data, are growing much much more quickly than text is; that's how AIs got so good at predicting how proteins fold up. 

And there is vastly more information if AI's are trained on other types of data besides text, and some AI's are already being trained on unlabeled images and videos.  Yann LeCun, chief AI scientist at Meta, said that "although the 10^13  tokens used to train a LLM  sounds like a lot  (it would take a human 170,000 years to read that much) , a 4-year-old child has absorbed a volume of data 50 times greater than that just by looking at objects during his waking hours. We’re never going to get to human-level AI by just training on language, that’s just not happening".

And then there's synthetic data. AlphaGeometry was trained to solve geometry problems using 100 million computer generated synthetic examples with no human demonstrations, and it ended up being as good at solving difficult geometry problems as the very best high school students in the entire nation. 


AI researchers are starting to change their strategy and have their AI's reread their training set many times because AI's operate in a statistical way so rereading improves performance 




Andy Zou at Carnegie Mellon University says  "once  an AI has got a foundational knowledge base that’s probably greater than any single person could have, it no longer needs more data to get smarter. It just needs to sit and think. I think we’re probably pretty close to that point.”

John K Clark    See what's on my new list at  Extropolis
nps






Cosmin Visan

unread,
Dec 12, 2024, 4:39:47 PM12/12/24
to Everything List
Magic!

Brent Meeker

unread,
Dec 12, 2024, 9:38:11 PM12/12/24
to everyth...@googlegroups.com
Magic is always the explanation of those who can't understand.

Brent
--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/everything-list/87d36fd7-9b3d-44e7-8bf7-885e87eca4e4n%40googlegroups.com.

Alan Grayson

unread,
Dec 13, 2024, 2:29:37 AM12/13/24
to Everything List
On Thursday, December 12, 2024 at 7:38:11 PM UTC-7 Brent Meeker wrote:
Magic is always the explanation of those who can't understand.

Brent

There's plenty of magic, under a different name, in physics. Another pitfall is religating hidden knowledge, aka occult knowledge, such as the Chakras in Yoga, to de facto magic or someone's overactive imagination. AG 

Cosmin Visan

unread,
Dec 13, 2024, 4:18:32 AM12/13/24
to Everything List
When you base an invention on the world of finite forms, of course that invention will be limited. You will never replicate the powers of consciousness, because consciousness draws its powers from the infinite world of the formless. And drawing from an infinite source, it is able to produce infinite forms and it doesn't need quazillions of forms to learn. A baby learns to speak from just a few examples, because what the parents to is not to provide raw data to the baby, but to stimulate the baby's consciousness to access the formless source and to draw from there whatever forms it needs in order to be able to speak and generally learn anything.

Terren Suydam

unread,
Dec 13, 2024, 12:31:49 PM12/13/24
to everyth...@googlegroups.com
Babies don't exist.

--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.

Quentin Anciaux

unread,
Dec 13, 2024, 12:34:44 PM12/13/24
to everyth...@googlegroups.com

John Clark

unread,
Dec 13, 2024, 12:37:09 PM12/13/24
to everyth...@googlegroups.com
On Fri, Dec 13, 2024 at 12:31 PM Terren Suydam <terren...@gmail.com> wrote:

Babies don't exist.

And existence doesn't exist.  Unfortunately Cosmin Visan does.

John K Clark



Cosmin Visan

unread,
Dec 13, 2024, 1:58:49 PM12/13/24
to Everything List
It would have been such a nice world if adults were actually adults.

Quentin Anciaux

unread,
Dec 13, 2024, 2:06:24 PM12/13/24
to everyth...@googlegroups.com
You can render mine better by leaving this list forever, obviously you don't give a damn fuck about reason, science, the everything, discussion. Just go to the troll list where you belong. 

--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.

Brent Meeker

unread,
Dec 13, 2024, 6:46:07 PM12/13/24
to everyth...@googlegroups.com



On 12/13/2024 1:18 AM, 'Cosmin Visan' via Everything List wrote:
When you base an invention on the world of finite forms, of course that invention will be limited. You will never replicate the powers of consciousness, because consciousness draws its powers from the infinite world of the formless. And drawing from an infinite source, it is able to produce infinite forms and it doesn't need quazillions of forms to learn.
Let's see you produce and infinite form or two.


A baby learns to speak from just a few examples, because what the parents to is not to provide raw data to the baby,
Twins often invent their own language which the speak to each other.  Evolution has provided the raw data to create language.


but to stimulate the baby's consciousness to access the formless source and to draw from there whatever forms it needs in order to be able to speak and generally learn anything.
Woo-Woo magic.

Brent


On Friday, 13 December 2024 at 09:29:37 UTC+2 Alan Grayson wrote:
On Thursday, December 12, 2024 at 7:38:11 PM UTC-7 Brent Meeker wrote:
Magic is always the explanation of those who can't understand.

Brent

There's plenty of magic, under a different name, in physics. Another pitfall is religating hidden knowledge, aka occult knowledge, such as the Chakras in Yoga, to de facto magic or someone's overactive imagination. AG 

On 12/12/2024 1:39 PM, 'Cosmin Visan' via Everything List wrote:
Magic!

On Thursday, 12 December 2024 at 20:00:58 UTC+2 John Clark wrote:
The number of "tokens" (words or parts of words) used to train LLMs is 100 times larger than it was in 2020, the largest are now using tens of trillions.  if you only consider text then the entire Internet only contains about 3,100 trillion tokens. The amount of text LLMs train on is doubling every year but the amount of human generated text on the Internet is only growing at about 10% a year, if that trend continues AIs will run out of text somewhere around 2028.  Does that mean AI progress is about to hit a wall? I don't think so for the following reasons:

For one thing, because of improvements in algorithms, the computing power needed for a Large Language Model  to achieve the same performance has halved about every 8 months. 



And computer chips specialized for AI rather than general computing, like those made by Nvidia and other companies, are getting faster even more rapidly than Moore's Law. Also, the rate of growth of specialized data sets, such as astronomical and biological data, are growing much much more quickly than text is; that's how AIs got so good at predicting how proteins fold up. 

And there is vastly more information if AI's are trained on other types of data besides text, and some AI's are already being trained on unlabeled images and videos.  Yann LeCun, chief AI scientist at Meta, said that "although the 10^13  tokens used to train a LLM  sounds like a lot  (it would take a human 170,000 years to read that much) , a 4-year-old child has absorbed a volume of data 50 times greater than that just by looking at objects during his waking hours. We’re never going to get to human-level AI by just training on language, that’s just not happening".

And then there's synthetic data. AlphaGeometry was trained to solve geometry problems using 100 million computer generated synthetic examples with no human demonstrations, and it ended up being as good at solving difficult geometry problems as the very best high school students in the entire nation. 


AI researchers are starting to change their strategy and have their AI's reread their training set many times because AI's operate in a statistical way so rereading improves performance 




Andy Zou at Carnegie Mellon University says  "once  an AI has got a foundational knowledge base that’s probably greater than any single person could have, it no longer needs more data to get smarter. It just needs to sit and think. I think we’re probably pretty close to that point.”

John K Clark    See what's on my new list at  Extropolis

--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.

Brent Meeker

unread,
Dec 13, 2024, 7:57:29 PM12/13/24
to everyth...@googlegroups.com
Wonder what  he looks like?

https://www.youtube.com/@ROForeverMan

Brent
--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.
Message has been deleted

Cosmin Visan

unread,
Dec 14, 2024, 4:21:05 AM12/14/24
to Everything List
@Brent The only woo-woo is your belief in "matter".

Cosmin Visan

unread,
Dec 14, 2024, 4:22:13 AM12/14/24
to Everything List
@Brent. Thank you for promoting me. Did you listen to the consciousness presentation ? What did you understand ?

Bonus for my fans, a song that I composed:


Cosmin Visan

unread,
Dec 14, 2024, 4:22:48 AM12/14/24
to Everything List
@Quentin Why are you sad ? No girlfriend in your life ? Yes, I know, it is hard. Hang in there my friend!

Brent Meeker

unread,
Dec 14, 2024, 3:57:55 PM12/14/24
to 'Cosmin Visan' via Everything List
I can successfully test my belief in matter.  The fact that you did not already know this casts strong doubt on your telepathic powers.

"The second reason was that I already knew that I have telepathies when I’m in relationships, thus I wanted to see what kind of telepathies appear if I involve more than one girl."

From https://philpapers.org/archive/VISMAC-3.pdf

I guess if I were to write such stupid drivel I wouldn't use my real name either.

Brent

Cosmin Visan

unread,
Dec 15, 2024, 5:37:11 AM12/15/24
to Everything List
You make the classical confusion between epistemology and ontology. Only because you can use something, it doesn't mean that that something exists. Only because you watch a movie with Spider-Man, it doesn't mean Spider-Man exists.

Also, I highly recommend you to perform for yourself such telepathy experiment. Thank you for reading my papers!

Brent Meeker

unread,
Dec 16, 2024, 2:45:15 PM12/16/24
to 'Cosmin Visan' via Everything List
So you use non-existent telepathy.  Well I guess that's easiest kind to obtain. 

I'm quite clear on the meaning of epistemology and ontology.  Having knowledge of Cosmin Visan does mean he exists.

Brent

Cosmin Visan

unread,
Dec 17, 2024, 4:34:09 AM12/17/24
to Everything List
What makes you lie about the existence of telepathy ? Does it make you happy ? Does it replace the lack of vvahmen in your life ?

Brent Meeker

unread,
Dec 17, 2024, 3:48:37 PM12/17/24
to everyth...@googlegroups.com
You're one who lies about it.  Which no doubt makes you happy since it supports your conviction that you're smarter than everyone else. 

Brent

Cosmin Visan

unread,
Dec 18, 2024, 2:57:23 AM12/18/24
to Everything List
Why are you envious on me ? Because your life is a failure ? Don't you think a more productive way would be to actually do something about your life instead of hating random people for your own failures ?
Reply all
Reply to author
Forward
0 new messages