Focus On AI

Rebooting Copyright For The Age Of AI

“Is This What We Want?” is the name of a silent album released by UK musicians, including Annie Lennox and Kate Bush, to protest UK government plans to allow AI companies to use copyright-protected work without permission. It is just one of the many ways that well-known figures from the publishing, music, film, TV, design and performing arts sectors are displaying their displeasure over proposed changes to the country’s copyright law.

Data is the lifeblood of artificial intelligence, and large language models – such as ChatGPT – are training on vast amounts of publicly available data sets. They are reeling in content on the Internet produced by musicians, journalists, artist and authors, ballooning their own valuations while threatening the livelihoods of content creators.

Copyright lawsuits filed against GenAI companies abound, alleging that the way they operate amounts to theft. Examples include the New York Times v OpenAI lawsuit in the U.S., and in the Getty Images v Stability AI case in the UK. The allegations in AI and copyright cases generally split into two parts: first, that the outputs of AI models constitute an illegal copy; second, that using copyrighted works in training data for AI (inputs) is a breach of the image owner’s copyright.

Requiring developers to license all the material they use to train models would be very difficult due to the distributed nature of data and ownership, argues the Tony Blair Institute for Global Change (TBI). To provide legal clarity and accelerate AI development, some countries have already taken a lenient view on use of publicly available data for AI training, so if other countries take a restrictive stance, it will drive development elsewhere. And even if a workable solution for payment were found it would likely stifle competition since only large, well-funded AI companies could afford to pay.

“As technologies and societies evolve so must regulations,” Jakob Mökander, TBI’s Director of Science & Technology Policy said in an interview with The Innovator. “We need to find an approach that makes sense in the digital age. It is a question that all governments will face.”

On April 2 TBI published a report on rebooting copyright that says the current situation is unsustainable. It argues that the status quo harms all stakeholders, including creators, who are not properly remunerated for their labor; rights holders, who struggle to exercise control over how their works are used; AI developers, who face hurdles when it comes to training AI models, and society at large, which risks missing out on benefiting from AI diffusion and adoption. “Bold policy solutions are needed to provide all parties with legal clarity and unlock investments that spur innovation, job creation and economic growth,” says the report.

The TBI report supports the position favored by the UK government: a text and data mining (TDM) exception for AI model training with the possibility for creators and rights holders to opt out. This would make it legal to train AI models on publicly available data for all purposes, while giving rights holders more control over how they communicate their preferences with respect to AI training, argues the report.

The report notes that it is important to separate the debates around AI outputs and AI training. AI outputs should not be allowed to reproduce original works without proper license and remuneration, says the report, but prohibiting AI models from training on publicly available data would be misguided and impractical. “The free flow of information has been a key principle of the open Web since its inception,” says the report. “To argue that commercial AI models cannot learn from open content on the Web would be close to arguing that knowledge workers cannot profit from insights they get when reading the same content.”

There are better ways of supporting the creative industries, says Mökander, “There needs to be increased funding for creators in the digital age, we want to have flourishing industries, but copyright law may not be the best way to do that,” he says

The report suggests some alternative approaches to helping the creative industries, including the creation in the UK of a Centre for AI and Creative Industries which would serve three functions: bringing together experts and representatives; acting as an engine to create new technologies and infrastructures to support growth in machine learning in the UK creative industries; and providing much-needed training and expertise across academia and industry. If more funding for the arts are needed, and if governments need to raise more funds for this, one option to consider is taxing data connections on fixed lines and mobile devices, adding pennies per month to the Internet Service Provider (ISP) bills of households and businesses who benefit from using AI tools.

“Rather than fighting to uphold 20th-century regulations, rights holders and policymakers should focus on building a future where creativity is valued and respected alongside AI innovation,” says the report: “Copyright law provides insufficient clarity for creators, rights holders, developers and consumer groups, impeding innovation while failing to address creator concerns about consent and compensation. The question is not whether generative AI will transform creative industries (it already is) but how to make this transition equitable and beneficial for all stakeholders.”

The Trouble With Copyright

The truth is that no one is happy with the status quo, says Mökander.

Just ask American author Cory Doctorow. “For 40 years, the scope and duration of copyright have monotonically increased, the evidentiary burden for copyright claims has declined, and the statutory damages for copyright infringement have expanded,” he wrote in an online article.  “Publishing and other creative industries’ generate more money than ever – and yet, despite all this copyright and all the money that sloshes around as a result of it, the share of the income from creative work that goes to creators has only declined. The decline continues. There is no bottom in sight.”

Doctorow uses the following analogy to drive home his point: “If the bullies at the school gate steal your kid’s lunch money every day, it doesn’t matter how much lunch money you give your kid, he’s not gonna get lunch.  But how much lunch money you give your kid does matter – to the bullies. (Creators) are the hungry schoolkids. The cartels that control access to our audiences are the bullies. The lunch money is copyright.”

Strengthening copyright law would do little to benefit creators and requiring developers to license the materials needed to train AI would  threaten the development of more innovative and inclusive AI models, as well as important uses of AI as a tool for expression and scientific research, argues the Electronic Frontier Foundation (EFF), which has published a series of articles on problems with copyright in the age of AI.

Requiring researchers to license fair uses of AI training data could make socially valuable research based on machine learning and even text and data mining  prohibitively complicated and expensive, if not impossible, argues the EFF. It notes that researchers have relied on fair use to conduct TDM research for a decade, leading to important advancements in science and other fields.

For giant tech companies that can afford to pay, pricey licensing deals offer a way to lock in their dominant positions in the generative AI market by creating prohibitive barriers to entry, says the EFF. To develop a foundation model that can be used to build generative AI systems like ChatGPT and Stable Diffusion, developers need to train the model on billions or even trillions of works, often copied from the open Internet without permission from copyright holders. There’s no feasible way to identify all of the rights holders—let alone execute deals with each of them. Even if these deals were possible, licensing that much content at the prices developers are currently paying would be prohibitively expensive for most would-be competitors.

As the U.S. Federal Trade Commission recently explained, if a handful of companies control AI training data“they may be able to leverage their control to dampen or distort competition in generative AI markets” and “wield outsized influence over a significant swath of economic activity.”

The Way Forward

The UK’s proposal for a TDM exception with opt-out – which essentially allows the scraping of publicly available information- would bring UK regulation broadly in line with the European Union’s.

But other jurisdictions, such as Singapore and Japan, have more liberal copyright laws pertaining to AI training and China is speeding ahead. The current administration has indicated that the U.S. will not pursue strict AI regulations but there is ongoing litigation in the U.S. around AI training. What constitutes fair use of copyrighted materials in the U.S. will be decided on a case-by-case basis.

The legal landscape surrounding IP data scraping is not only complex it is rapidly evolving, says a February OECD report on data scraping. What’s more different actors in the data scraping ecosystem raise various types of legal issues. Some also use data scraping to support research and other endeavors, suggesting the need for policy tools tailored to different use cases, says the OECD report. The data scraping ecosystem encompasses research institutions and academia, AI data aggregators, as well as technology companies and platform operators. Research institutions and academia frequently employ data scraping to gather data for academic and scientific purposes. AI data aggregators make scraped data available to third parties, often without clear licensing terms or clear disclosure of data provenance, raising IP and other legal concerns. Technology companies and platform operators are sources of scraped data and regular data scrapers themselves.

The OECD is promoting a global data scraping code of conduct, standard contract terms, standard technical tools and initiatives for building awareness that would chart a responsible path for data scraping in an internationally coordinated manner. “This would be particularly effective if it is developed with input from a broad and diverse set of stakeholders, including rights holders, researchers, AI developers, civil society, and policymakers,” says the OECD report.

TBI’s Mökander says he would welcome  globally recognized codes of conduct. ”AI training data are only useful if there are clear international standards,” he says. “ If not, we risk a race to the bottom, pushing AI development to other jurisdictions with more lenient regulations. In fact, harmonized international standards should be top priority for policymakers seeking to build a flourishing ecosystem for the arts and AI.”

Tech tools will help ensure compliance, he says. For example, AI company Spawning has developed a Do Not Train registry that allows artists to tag work around the Internet as copies of their original. Developers can then use “data-diligence” software from Spawning to check whether URLs have been opted out.Another tool cited in the TBI report is ProRata.ai, a new company that uses tech to enable generative artificial intelligence (GenAI) platforms to attribute and compensate content owners.

ProRata CEO Bill Gross invented tech that can reverse-engineer where an answer came from and what percentage comes from a particular source so that owners can be paid for the use of their material on a per-use basis, says Gross, who has patented the technology. ProRata pledges to share half the revenue from subscriptions and advertising with its licensing partners, help them track how their content is being used by AIs, and aggressively drive traffic to their websites.

When a user poses a query ProRata’s algorithm compiles an answer from the best information available. At the top of the page there is an attribution bar which specifies where the answer came from. It might say, for example, 30% of this answer came from The Atlantic, 50% from Fortune and 20% from The Guardian. The publications are immediately compensated according to their contribution to the answer and a side panel displays the original articles and enables users to click on the original source to learn more. Think of it as “attribution-as-a-service.,” Gross said in an interview earlier this year with The Innovator. “Just as Nielsen measures how TV shows are watched to determine what advertisers should pay, we are moderating the output of the queries to determine how much GenAI providers should pay content providers.”

In the future it should be technically simple to build AI agents that can track creators’ portfolios, maintain registries and initiate robot-to-robot interactions with other websites, asking them to remove content, says the TBI report. These agents are expected to simplify content attribution for AI companies, enabling them to effectively track online content origins and eliminating plausible deniability for developers who claim ignorance about opted-out work appearing in their systems.

If the right policies are put in place the AI revolution “can be the standout engine for artistic and cultural renewal of our era,” says the TBI report. It could also help countries like the UK lead in the AI sector.

Time To Act

But time is not on anyone’s side, says the TBI report.  Large, effective foundation models already exist and are publicly accessible. They will continue to grow in capability and will be used by an increasing number of people. They will be developed around the world, in jurisdictions with very relaxed copyright laws, and used as tools to the extent that they will inevitably make some jobs redundant. At the same time, countries with restrictive laws will push developers to move to countries with less stringent measures, says the TBI report. “The longer governments take to tackle the issue of AI and copyright, the more they will inhibit innovation and entrench large AI developers in the global competition for AI leadership.”

This article is content that would normally only be available to subscribers. Sign up for a four-week free trial to see what you have been missing.

 

 

 

About the author

Jennifer L. Schenker

Jennifer L. Schenker, an award-winning journalist, has been covering the global tech industry from Europe since 1985, working full-time, at various points in her career for the Wall Street Journal Europe, Time Magazine, International Herald Tribune, Red Herring and BusinessWeek. She is currently the editor-in-chief of The Innovator, an English-language global publication about the digital transformation of business. Jennifer was voted one of the 50 most inspiring women in technology in Europe in 2015 and 2016 and was named by Forbes Magazine in 2018 as one of the 30 women leaders disrupting tech in France. She has been a World Economic Forum Tech Pioneers judge for 20 years. She lives in Paris and has dual U.S. and French citizenship.