In addition to the three authors above, there are also Zadie Smith, Stephen King, Rachel Cusk and Elena Ferrante... According to an analysis of Books3 - a dataset exploited by many companies to build AI tools, more than 170,000 books have been fed into the above models, including Meta and Bloomberg.
bell hooks, Jennifer Egan, George Saunders, Stephan King, Margaret Atwood, Zadie Smith and Haruki Murakami are among the writers whose works have been illegally used to train AI.
Accordingly, innovative AI applications like ChatGPT are designed to understand and generate text like humans. To achieve this, the system requires a large amount of text for “training”. According to writer and programmer Alex Reisner, who revealed the shocking truth above, that “input” comes not only from “open” sources such as Wikipedia and online articles, but also from books to ensure high quality.
The number of illegally used books was also "revealed" including 33 books by Margaret Atwood, at least 9 books by Haruki Murakami, 9 books by bell hooks, 7 books by Jonathan Franzen, 5 books by Jennifer Egan and 5 books by David Grann.
Books3 was used to train LLaMA, one of Meta's large language models – the most famous of which is OpenAI's ChatGPT – to generate content based on patterns it learns from training text. The dataset was also used to train Bloomberg's BloombergGPT, EleutherAI's GPT-J, and is “likely” to be used in other AI models as well.
The newly revealed Books3 titles are about one-third fiction and two-thirds non-fiction, most of them published in the last two decades. The number of books illegally used also "revealed" includes 33 books by Margaret Atwood, at least nine by Haruki Murakami, nine by Bell Hooks, seven by Jonathan Franzen, five by Jennifer Egan, and five by David Grann.
In addition to the authors listed above, books by George Saunders, Junot Díaz, Michael Pollan, Rebecca Solnit, and Jon Krakauer also appear in the dataset. These titles span publishers large and small, including more than 30,000 titles from Penguin Random House, 14,000 from HarperCollins, 7,000 from Macmillan, 1,800 from Oxford University Press, and 600 from Verso...
The battle between the tech industry and the publishing world is predicted to be very near.
This follows a lawsuit filed last month by three writers, Sarah Silverman, Richard Kadrey, and Christopher Golden, alleging that their copyrighted works “were copied and used as input to train AI tools.” Analysis showed that the three plaintiffs’ works were indeed part of Books3.
OpenAI, the company behind the AI chatbot ChatGPT, has also been accused of training its model on copyrighted works. A clue to this data comes from a 2020 report the company released that mentions two “internet-based book sources,” one of which is called Books2 and is estimated to contain nearly 300,000 titles.
However, many people suspect that with such a large number of works, the source can only come from "dark libraries" such as Library Genesis (LibGen) and Z-Library, where data is secured en masse through the torrent system. This is known as a source of unlicensed books, with a large number of visitors around the world .
Shawn Presser, the independent AI developer who originally created Books3, told The Guardian that he sympathized with the authors’ concerns. He said he created a database that anyone could use to develop AI tools, and was concerned about the risks of large companies taking control of the technology.
Reisner's investigation also revealed a massive dataset called The Pile, which contains Books3 data as well as documents from various sources, such as YouTube subtitles and European Parliament documents...
The Pile data extracted and analyzed by Reisner exposed the scale and diversity of pirated works used to train AI, leading to ethical concerns about the origin and legality of this data.
Reisner also said that while a Meta spokesperson declined to comment on the use of Books3, Stella Biderman, CEO of EleutherAI, did not deny the use of this data source for GPT-J.
A Bloomberg spokesperson also confirmed to The Guardian that the company had used the dataset in the past, adding: “We will not use the Books3 dataset as an input for training the upcoming BloombergGPT.”
The use of copyrighted books to train AI models raises complex questions about ethics, copyright, and the future of creative works. As AI technology continues to advance, the issue of illegal content being used as input will require a more balanced and legal approach. And the issue of bridging the gap between the “openness” of AI development and the rights of creators, therefore, requires a balance to ensure that technological advancement does not come at the expense of intellectual property rights. As a result, a confrontation may be looming between the tech industry and the publishing world.
Source link
Comment (0)