Vietnam.vn - Nền tảng quảng bá Việt Nam

Something that's about to become a memory on the Internet.

The explosion of AI content has created a trustworthiness problem, as purely human data is becoming increasingly scarce.

ZNewsZNews09/06/2025

Purely human content is on the list of scarce resources in the age of AI. Photo: Advertising Week .

The emergence of ChatGPT in 2022 triggered an explosion of AI-powered content across the internet. Gartner predicts that by 2026, 90% of internet content will be generated by AI, including text, images, and videos .

AI is trained to understand human thought. However, if there is no longer pure human-generated data, this technology will use its own old information, like a photocopier copying itself.

Many researchers compare original, human-generated content to a kind of "clean" steel in modern times, equivalent to rarity and difficulty to find. They fear that if no one stores copies of data before 2022, the internet will completely lose its integrity.

A historical catastrophe repeats itself.

In the post-nuclear era, scientists discovered that all steel produced after 1945 was contaminated. The atomic bombs had contaminated the atmosphere with radiation, which spread to the metals produced in that environment.

This resulted in much of the steel being unusable for high-precision measuring equipment such as Geiger counters and many other sensitive sensors. The solution was to recover old steel from warships sunk before the war, lying deep at the bottom of the ocean, where it would not be affected by radioactive fallout.

For AI developers, most models are trained using massive datasets of human data collected from the internet. But if today's software learns from text it generated in the past, the models risk crashing, diluting their originality and depth.

Noi dung dang tin cay anh 1

The battleship Hindenburg, which sank during World War I, has been salvaged. Photo: Reuters Connect.

This makes human-generated content, especially that created before 2022, more valuable, according to Will Allen, vice president of Cloudflare, which operates one of the world's largest internet networks. He argues that it helps AI models, as well as society as a whole, stay grounded in a shared reality. Things would become complicated without that foundation.

Platforms are especially important in high-tech fields such as medicine, law, or taxation. For example, a doctor should rely on content written by human experts and factual research, not on AI-generated sources.

This threat is also becoming a reality. A year after ChatGPT launched, venture investor Paul Graham recounted that he had to search for older content for a simple lookup to avoid “AI-generated SEO bait.” Malte Ubl, CTO of the AI ​​startup Vercel, responded that Graham was essentially filtering the internet for content “before it was contaminated by AI.”

Matt Rickard, a former Google engineer, agrees. He wrote in a 2023 blog post that AI gathers data from the internet, but increasingly, much of the content on the internet is created by AI itself. “Chatner output is very difficult to detect. Finding training data that hasn’t been tampered with by AI will become increasingly difficult,” Rickard explained.

The "search for steel on the seabed"

The answer to this problem lies in preserving the human-generated version of data from before the AI ​​boom. One of the pioneers in this field is John Graham-Cumming, board member and Chief Technology Officer of Cloudflare.

His project, the website LowBackgroundSteel.ai, lists datasets, paths, and media that existed before 2022. One example given is GitHub's Arctic Code Vault, an open-source software archive buried in an abandoned coal mine in Norway, holding data since February 2020.

Noi dung dang tin cay anh 2

Graham-Cumming's human data preservation project. Photo: Lowbackgroundsteel.ai.

Another data source he cited was “wordfreq,” a project that tracks the frequency of word usage online. Linguist Robyn Speer maintained it until 2021.

"AI generation has polluted the data," Speer said. She gave the example of ChatGPT's over-obsession with the word "delve," leading to its increased appearance recently. This skews data on the internet, making it less reliable in reflecting how humans write and think.

AI models partially trained on synthesized content can speed up workflows and eliminate tediousness in creative tasks. However, beyond just performance, users may still need to rely on original human-generated content for accurate assessments, much like using "low-level steel" for precise measurements.

Scientists have developed various methods for producing steel using pure oxygen. According to Business Insider , this reminds us that preserving the past may be the only way to build a reliable future.

Source: https://znews.vn/thu-sap-thanh-hoai-niem-บน-internet-post1559151.html


Comment (0)

Please leave a comment to share your feelings!

Heritage

Figure

Enterprise

News

Political System

Destination

Product

Happy Vietnam
Vietnamese country roads

Vietnamese country roads

Colors of Vietnam

Colors of Vietnam

Rice Milk

Rice Milk