Stay informed with free updates
Simply sign up to the Artificial intelligence myFT Digest — delivered directly to your inbox.
Top artificial intelligence companies are facing a wave of copyright litigation and accusations that they are aggressively scraping data from the web, a problem exacerbated as start-ups hit a “data frontier” hindering new advances in the technology.
This month, a trio of authors sued Anthropic for “stealing hundreds of thousands of copyrighted books”, claiming the San Francisco AI start-up “never sought — let alone paid for — a licence to copy and exploit the protected expression contained in the copyrighted works fed into its models”.
The class-action lawsuit adds to a long list of ongoing copyright cases, the most prominent of which was brought by the New York Times against OpenAI and Microsoft late last year. The Times claims the companies are “profit[ing] from the massive copyright infringement, commercial exploitation and misappropriation of The Times’s intellectual property”.
If the case is successful, the publisher’s arguments could be extended to other companies training AI models from across the internet, with the potential for further litigation.
AI companies have made significant strides forward in the past 18 months, but have begun to run up against what experts describe as a data frontier, forcing them to trawl ever-deeper recesses of the web, strike deals to access private data sets or rely on synthetic data.
“There’s no more free lunch. You can’t scrape a web-scale data set any more. You have to go and purchase it or produce it. That’s the frontier we’re at now,” said Alex Ratner, co-founder of Snorkel AI, which builds and labels data sets for companies.
Anthropic, a self-described “responsible” AI start-up, has also been accused by website owners of “egregious scraping” of web data to train its systems in the last month. Perplexity, an AI-powered search engine aiming to take on Google’s monopoly in web queries, has faced similar accusations.
Google itself has caused consternation among publishers, who have struggled to block the company from scraping their sites for its AI tool without also cutting themselves out of search results.
AI start-ups are engaged in a fierce race for dominance in which they require mountains of training data, along with increasingly sophisticated algorithms and more powerful semiconductors to help their chatbots generate creative, humanlike responses.
ChatGPT-parent OpenAI and Anthropic alone have raised more than $20bn to build powerful generative AI models, which can respond to prompts in natural language, and retain their edge over newer entrants, including Elon Musk’s xAI.
But the contest between AI companies has also put them in the crosshairs of publishers and owners of material needed to develop models.
The Times’s case aims to establish that OpenAI has effectively cannabilised its content and is reproducing it in ways “that substitute for The Times and steal audiences away from it”. A resolution in the case would provide greater clarity to publishers about the value of their content.
In the meantime, AI start-ups are striking deals with publishers to ensure their chatbots produce accurate, up-to-date responses. OpenAI, which recently announced its own search product, struck a deal with Condé Nast, publisher of the New Yorker and Vogue magazines, adding to tie-ups with others including The Atlantic, Time and The Financial Times. Perplexity has also signed revenue-sharing deals with a number of publishers.
Anthropic has yet to announce similar partnerships, but in February the start-up hired Tom Turvey, a 20-year Google veteran who had worked on the search giant’s partnership strategy with major publishers.
Google has done more than any other company to set a precedent for how the relationship between publishers and tech companies functions today. In 2015, the company won its case against a group of authors who claimed that its scanning and indexing of their works breached fair use. The victory hinged on the argument that Google’s use of the content was “highly transformative”.
The Times case against OpenAI rests on the claim that “there is nothing ‘transformative’” about how the tech company had used the newspaper group’s content. A verdict would provide a new precedent to publishers. Google’s case, however, took a decade to conclude, during which time the search engine had established a dominant position.