What happens to LLMs when their underlying data is no longer readily available?

Aug 16

A short background on data usage for LLMs helps us to look at the future of online content.

1. Data is the foundation in today’s emerging generative A.I. systems, which are given billions of parameters, such as examples of text, videos, and images. A large amount of that data is taken (or "scraped") from public-facing websites by researchers and amassed in large data sets, and can be downloaded and freely used, or used with data from additional sources.

2. For many years, A.I. developers such as OpenAI were able to collect data pretty easily. Though the generative A.I. boom of the past few years developed tensions with the owners of that underlying data — and have concerns about being used for A.I. training , or at least want to be compensated for the content- otherwise, the value of the information could be rendered minimal.

3. According to a study published this week by the Data Provenance Initiative, an Massachusetts Institute of Technology-led research group, over the past year, many of the leading sources for web traffic and data used for training A.I. models have restrained the use of their data.

4. The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

5. Popular websites such as Reddit, Inc. and Stackoverflow began charging A.I. companies for access to data to monetize their content.

6. However, some publishers have taken another route including legal action — such as The New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that those companies had used news articles from the content sites to train their models without garnering explicit permission.

What is the future of the relationship between LLM providers such Anthropic Cohere and others.

- Will this result in LLM companies need to generate their own content?

- Is this the beginning of more partnerships between content creators and AI infrastructure companies? Founders such as Dr. 🧪Matthew Mirman, PhD at Anarchy Labs are creating new exciting tools such as chat.dev that enables you to interact and identify information from any website.

- Will new models for content creating emerge? Exciting to see new communities form such as Cara Project by Jingna Zhang, which is social media and portfolio platform for artists and art enthusiasts that filters out generative AI images to focus on authentic creatives.

- How will edge computing and IoT impact and potentially create more expansive types of data? Lemurian Labs and Jay Dawani are working to make AI hashtag#compute more affordable, accessible, and efficient

Mitchell Kominsky

What happens to LLMs when their underlying data is no longer readily available?

California has approved new artificial intelligence laws

How do I shape the rules of the road for my startup?