Data scraped from public websites and compiled in large data sets is what allows generative A.I. tools like OpenAI’s ChatGPT to write, code and generate images and videos. The more high-quality data is fed into these models, the better their outputs generally are. But over the past year, many of the most important Web sources used for training AI models have restricted the use of their data, according to a study published in July by the Data Provenance Initiative, an M.I.T.-led research group.
The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, identified what it says is an “emerging crisis in consent,” as publishers and online platforms take steps to prevent their data from being harvested.
The trend raises questions of what will happen once available sources are exhausted. “If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems,” say the study’s authors. They warn that the rising number of restrictions “is foreclosing much of the open Web, not only for commercial AI, but non-commercial AI and academic purposes.”
Plan B appears to have problems of its own. As they reach the limits of human-made material that can improve the cutting-edge technology, AI companies, including OpenAI and Microsoft, are testing the use of so-called “synthetic” data — information created by AI systems to train large language models (LLMs). Research published in Nature on July 24 suggests the use of such data could lead to the rapid degradation of AI models. The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training. The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used.
A Growing Backlash
For years, A.I. developers were able to gather data relatively freely. But the generative A.I. boom of the past few years has led to tensions with the owners of that data — many of whom have misgivings about being used as A.I. training fodder, or at least want to be paid for it, notes a New York Times article.
As the backlash has grown, some publishers have set up paywalls or changed their terms of service to limit the use of their data for A.I. training. Others have blocked the automated web crawlers used by companies like OpenAI, Anthropic and Google.
Sites like Reddit and StackOverflow have begun charging A.I. companies for access to data, and a few publishers have taken legal action — including The New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.
Methods to block data from being gathered vary, according to the MIT report. For instance, having to individually prohibit lots of AI crawlers has motivated many domains to simply blanket prohibit any crawling. Domains are also limiting crawlers from non-profit archives such as the Common Crawl Foundation or Internet Archive, to prevent other organizations from downloaded their data for training. These archives are also used for non-commercial uses of AI, as well as academic research, knowledge, and accountability, well beyond the scope of AI. For instance, the Common Crawl is reported to be cited in 10,000+ research articles from varying fields. “This tension between data creators and, predominantly, commercial AI developers has left academic and non-commercial interests as secondary victims,” says the report. “As Web consent continues to evolve, we believe it is essential that these often essential facilities not be marginalized or severely hampered.”
Synthetic Data’s Drawbacks
Even if AI crawlers are not curtailed and the training data of most future models are scraped from the Web, they will inevitably train on synthetic data produced by other LLMs, says the Nature article. The authors investigated what happens when text produced by, for example, a version of GPT, forms most of the training dataset of following models. They say they discovered that indiscriminately learning from data produced by other models causes ‘model collapse’—a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time.
The authors of the Nature paper note the importance of access to real human-produced data.
“To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remain available over time,” says the article. “The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale. One option is community-wide coordination to ensure that different parties involved in LLM creation and deployment share the information needed to resolve questions of provenance. Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data crawled from the Internet before the mass adoption of the technology or direct access to data generated by humans at scale.”
The upshot is that the today’s largest providers of AI models may have a long-term advantage over new rivals.
“One key implication of model collapse is that there is a first-mover advantage in building generative AI models,” says a companion piece in Nature by Emily Wenger of Duke University in the U.S. “The companies that sourced training data from the pre-AI Internet might have models that better represent the real world.”
IN OTHER NEWS THIS WEEK
Unlocking The Potential of AI-Driven Chemistry
Sandbox AQ, a spin-out of Google that leverages quantum technologies and AI on today’s classical computing platforms, on July 23 announced what it said is a groundbreaking advancement that pushes the limits of computational chemistry, impacting fields such as biopharma, chemicals, materials science and other industries.
Collaborating with Nvidia, SandboxAQ leverages Large Quantitative Models (LQMs) and the Nvidia CUDA-accelerated Density Matrix Renormalization Group (DMRG) algorithm. This allows scientists to perform highly accurate Quantitative AI simulations of real-life systems with exacting accuracy, going beyond what Large Language Models (LLMs) and other AI models can currently do.
Combining the CUDA-DMRG algorithm, the Nvidia Quantum platform, and NVIDIA accelerated computing speeds up these highly accurate calculations more than 80x, compared with traditional 128-core Central Processing Unit (CPU) computations, according to SandboxAQ. At the same time, it more than doubles the sizes of computable catalysts and enzyme active sites calculated by the system.
“Advanced computing is opening new frontiers in scientific research, Dr. Martin Ganahl, senior staff scientist at SandboxAQ said in a statement. “Our use of NVIDIA technology has allowed us to address some of the most challenging problems in chemistry. We are not only advancing our understanding of material science and chemistry, but also paving the way for the next wave of innovations in drug discovery and catalysis to tackle currently-untreatable conditions and find safer and cheaper ways to synthesize molecules and materials.”
While LLMs are limited to the data available on the Internet or other existing sources, SandboxAQ says its LQMs, which access an unlimited supply of training data generated by physics-based Quantitative AI simulations, can make accurate predictions about the world because they are grounded in exact, physics-based data.
In a separate announcement SandboxAQ said the U.S. Army will use Quantitative AI software to develop advanced battery chemistries and designs for diverse applications such as electric vehicles. Unmanned Aerial Vehicles (UAVs), and portable power solutions. SandboxAQ, in partnership with Comprehensive Carbon Impact, will also leverage its Quantitative AI software to discover novel alloy materials specifically designed for armored vehicles for the U.S. Army.
The Google spin-off is not the only company using quantum and high speed computing to advance chemistry.
NobleAI, a pioneer in science-based AI solutions for chemical and material informatics, in December 2023 announced a collaboration with Azure Quantum Elements (AQE) – a cloud-based service from Microsoft that accelerates scientific discovery by integrating the latest breakthroughs in High-Performance Computing (HPC), AI, and quantum computing.
The alliance brings together AQE’s state-of-the-art molecular simulation and HPC capabilities with NobleAI’s AI-driven solutions for rapidly exploring potential chemical and material formulations. NobleAI’s approach relies on Specialized Science-Infused Models (SSMs) as opposed to the Large Language Models powering Generative AI. By combining the power of AI with applicable scientific laws, NobleAI says it helps companies slash research cycle times and speed product development – even when they are starting with smaller private or industry-specific data sets.
Azure Quantum Elements “aims to compress 250 years of chemistry into the next 25,” Dr. Nathan Baker, Product Manager of Azure Quantum Elements said in a statement.
OpenAI Enters Google-Dominated Search Market With SearchGPT
OpenAI is venturing into a territory long dominated by Google with the selective launch of SearchGPT, an artificial intelligence-powered search engine with real-time access to information from the Internet.The move, announced on July 25, also places the AI giant in competition with its largest backer Microsoft’s Bing search and emerging services such as Perplexity — a search-focused AI chatbot firm backed by Amazon founder Jeff Bezos and semiconductor giant Nvidia. The tool, called SearchGPT, will summarize the information found on websites, including news sites, and let users ask follow-up questions, just as they can currently with OpenAI’s popular chatbot, ChatGPT. The sources are linked at the end of each answer in parentheses.OpenAI also built a sidebar where it said users can see more results and sources with relevant information.
AI Helps To Create Breakthrough In Weather And Climate Forecasting
The Financial Times reports that artificial intelligence has helped to make a breakthrough in accurate long-range weather and climate predictions, according to research that promises advances in both forecasting and the wider use of machine learning. Using a hybrid of machine learning and existing forecasting tools, a model led by Google called NeuralGCM successfully harnessed AI to conventional atmospheric physics models to track decades-long climate trends and extreme weather events such as cyclones, a team of scientists found. This combination of machine learning with established techniques could provide a template for refining the use of AI in other fields from materials discovery to engineering design, the researchers suggest. NeuralGCM was much faster than traditional weather and climate forecasting and better than AI-only models at longer-term predictions, they said.
Google AI Systems Make Headway With Math, Bringing The Tech Closer To Reasoning
Alphabet’s Google unveiled a pair of artificial intelligence systems on July 25 that demonstrated advances in solving complex mathematical problems, a key frontier of generative AI development.The current class of AI models, which work by statistically predicting the next word, have struggled with abstract math, which requires greater reasoning capabilities resembling human intelligence. DeepMind, the company’s AI unit, published results showing that its new AI models in development, called AlphaProof and AlphaGeometry 2, solved four out of six questions at the 2024 International Math Olympiad, a competition for high school students.Google said in a blog post that one question was solved within minutes, but others took up to three days, longer than the competition’s time limit. Still, the results represent the best marks in the competition by an AI system to date.
The company said it created AlphaProof, a system focused on reasoning, by combining a version of Gemini, the language model behind its chatbot of the same name, with AlphaZero, another AI system which previously bested humans in board games such as chess and Go. AlphaProof solved three of the competition’s problems, including the most difficult question, which was solved by just five out of more than 600 human contestants.
Call My AI Agent
Artificial intelligence-powered agents will be able to work together and solve tasks in a so-called multi-agent AI system by 2025, according to technology services giant Capgemini.
Such a system would entail a collection of agents that work together to solve tasks in a distributed and collaborative way, according to Capgemini.
In its report Harnessing the Value of Generative AI,” Capgemini noted 82% of the companies it surveyed plan to integrate AI agents within one to three years, while only 7% have no plans to integrate these agents. The research relied on a survey of more than 1,100 companies with revenues of $1 billion or more.
AI agents fall into two types: individual agents that carry out tasks on your behalf, and multi-agent technology or agents talking to agents. Pascal Brier, the company’s chief innovation officer, told CNBC in an interview that Cap Gemini expects AI agents will be able to work together and solve tasks in a so-called multi-agent AI system by 2025. For example, a marketing-focused AI agent that’s creating an ad campaign for an organization to run in Germany, could autonomously work with another agent in that same organization’s legal department to make sure that it’s legally sound.
U.S., UK, EU Regulators Sign Joint Statement On Effective AI Competition
The European Commission, the UK’s Competition and Markets Authority, the U.S. Department of Justice and the U.S. Federal Trade Commission on July 23 signed a joint statement setting out principles to protect consumers and safeguard against tactics that could undermine fair competition in the artificial intelligence sector.
EU AI Act Goes Into Effect August 1
The EU’s AI Act governing safe uses of AI while strengthening investment and innovation across EU countries takes effect August 1, 2024. The regulation applies to organizations globally that provide, deploy, or distribute AI-based products, services, or other innovations that are available to EU citizens.
Bengaluru-Based Startup Achieves U.S. Approval For Its Microbial Protein
String Bio, a Bengaluru-based biotech company announced Generally Recognized as Safe (GRAS) status of its microbial protein, PRO-DG, for use in crustacean feed in the U.S. The GRAS status, governed by the US Food and Drug Administration (FDA), sets a high standard for ingredients to be approved for use in both feed and food. String Bio’s PRO-DG contains approximately 70% protein derived from methanotrophic bacteria, manufactured through the company’s patented String Integrated Methane Platform. The protein’s tolerability by shrimp was evidenced through peer-reviewed publications. With the global population projected to reach 10 billion by 2050, the demand for sustainable and traceable protein sources is critical. The total aquaculture production is expected to expand to 109 million tons by 2030. String Bio says its PRO-DG addresses this “protein challenge” by offering a sustainable feed ingredient, essential for the growing aquaculture market while protecting marine ecosystems and resources. For more on String Bio, and how it is making protein from greenhouse gases, see The Innovator’s recent story by clicking here.
To access more of The Innovator’s News In Context stories click here.