from Gergely Orosz | by Gergely Orosz

Gergely Orosz

@GergelyOrosz

4 months ago

View on Twitter

We know that when LLM tools are trained on LLM-generated output: they regress. It’s partially why companies like Google and OpenAI are licensing Reddit data to train their models. As Reddit is assumed to be human content. Well, it was. More and more of it will be subtle AI spam. t.co/HjHSettWQj

Which is yet another reason it could be why LLM evolution stopped at ChatGPT-4 level. It’s 16 months later that no new models have made the kind of jump we’ve seen from ChatGPT 3.5 to 4.0 (in just 6 months.) When your training data is increasingly AI generated, it’s hard!!

Also, there's this dilemma on how LLM tools do not respect robots.txt. They ingest every website, even if that site does NOT want to lend its content as free training material. These sites generating heaps of LLM-generated garbage as some of their webpages could be a response.

More from @GergelyOroszReply on Twitter

Page created with TweetHunter

Write your own