In a recent investigation by The Atlantic, staff writer Alex Reisner uncovered that **major technology companies** have utilized at least **15 million YouTube videos** as training data for their **AI video generation** products. This extensive use of content has raised significant concerns regarding intellectual property rights and ethical practices in the rapidly evolving AI landscape.
The investigation highlights over a dozen prominent training datasets compiled and employed by companies such as **Microsoft**, **Meta**, **Snap**, **Tencent**, **Runway**, and **ByteDance**. These datasets have been instrumental in enhancing the quality of AI-generated videos, showcasing how unauthorized usage of **YouTube** content has fueled advancements in this sector. Reisner draws an analogy, stating, “Much as **ChatGPT** couldn’t write like **Shakespeare** without first ‘reading’ Shakespeare, a video generator couldn’t construct a fake newscast without ‘watching’ tons of recorded broadcasts.”
Scope of Unauthorized Data Usage
The Atlantic’s reporting briefly mentions that among the training data, over **30,000** videos from the **BBC** were included, alongside hundreds of thousands from renowned news publishers and creators such as **The New York Times**, **The Washington Post**, **The Guardian**, **Al Jazeera**, and **The Wall Street Journal**. Specifically, more than **88,000** videos were sourced from **Fox News**, roughly **70,000** from **ABC News**, and over **55,000** from **Bloomberg** channels.
Much of this content originates from platforms owned by **Vox Media**, including **Vox**, **Eater**, and **The Dodo**, which collectively account for over **30,000** videos. The **New York Times** alone contributed over **11,604** videos across different datasets, with a significant portion coming from the **Runway Gen-3** dataset, which was launched in June **2024** and received acclaim for its capabilities.
See alsoKansas Man Sentenced to 25 Years for Using AI to Create Child PornographyDespite the extensive use of these videos, **YouTube** CEO **Neal Mohan** has reiterated that it is against the platform’s terms of service for third parties to download content for training purposes. **Lauren Starke**, a spokesperson for Vox Media, emphasized, “In order to survive, AI platforms know they need (and their consumers want) quality, credible content like ours that gives their products relevance and purpose.” Starke also noted that these companies have spent heavily on AI infrastructure but comparatively little on the content that fuels their models.
The Legal Landscape and Implications
The investigation raises profound questions about copyright and licensing, especially as many news organizations have not authorized the use of their videos for AI training. The **New York Times** has stated that it has not sanctioned the use of its **YouTube** content for AI purposes, reinforcing its legal rights to determine how and where its content is used.
Additionally, partnerships between news outlets and AI companies are becoming more common, as seen with **Vox Media’s** deal with **OpenAI** that allows the latter to use its content for products like **ChatGPT**. Starke indicated that Vox Media is considering further partnerships while also preparing to protect its intellectual property through legal channels when necessary.
Furthermore, internal documents from **Runway**, published by **404 Media**, reveal that the company strategically targeted videos from high-quality channels for its datasets. The spreadsheet indicated that videos were tagged for their specific features, revealing an organized method of selecting content that would enhance AI training.
As the AI industry continues to advance, firms like **Runway** have already integrated their products into traditional media workflows, with companies such as **Netflix** and **Walt Disney Co.** utilizing Runway’s tools for content production. However, the absence of reported licensing agreements between **Runway** and the news publishers whose content was included remains a troubling aspect of this story.
As AI technology evolves rapidly, the need for clear guidelines and ethical frameworks around content sourcing becomes increasingly pressing. The implications of using copyrighted material without consent could have long-lasting effects on the landscape of journalism and the integrity of AI-generated content.
















































