Unraveling the Dangers of Training AI on AI-Generated Content

I’m really glad that people are finally talking about how dangerous it is for us to be training AI on other AI -generated content because it’s starting to be a real problem Basically, anytime AI mimics something, like these real handwritten numbers, it’s really predicting patterns, which means that it’ll slightly mess up anytime it generates new content But if you take those AI-generated numbers and you use them to train more AI, it’ll mess up a little bit more, and after enough reiterations it’ll become completely unintelligible The exact same thing is currently happening with language OpenAI is producing over 100 billion words a day, meaning that it’s filling up the internet with artificially generated content, but AI is also being trained on parts of the internet Normally, these large language models produce a very wide range of creative output, but as they’re increasingly trained on themselves, their vocabulary is gonna narrow down until it eventually becomes entirely meaningless And this is a real problem for niche but important AI, like medical chatbots, which will list fewer possible diseases because they’re being trained on narrower and narrower datasets But it also might be unavoidable It’s already so hard to separate what is and what is not AI-generated content, and some experts think that AI models will run out of publicly available training data within the next decade, meaning that their output is going to get more recursive and our models will probably eventually collapse