OpenAI GPT-4 trains on YouTube, Google reportedly agrees

OpenAI is alleged to have transcribed over a million hours of YouTube content to feed its advanced language model, GPT-4
The image shows OpenAI logo on a phone screen. — Pexels
The image shows OpenAI logo on a phone screen. — Pexels

A recent investigation reveals that leading AI firms, including OpenAI, Google, and Meta, might be engaging in controversial methods to gather high-quality data for training their AI models. The New York Times highlights that OpenAI is alleged to have transcribed over a million hours of YouTube content to feed its advanced language model, GPT-4.

OpenAI reportedly developed a specialised audio transcription tool, Whisper, facilitating the extraction of data from YouTube videos. Despite potential ethical concerns, OpenAI allegedly proceeded with this approach, considering it to fall within the bounds of fair use. Google, which owns YouTube, is also implicated in similar data acquisition strategies for its AI projects, potentially infringing upon creator copyrights.

Corroborating this, The Information also reported on OpenAI's supposed use of YouTube and podcast content for training its AI systems, involving company president Greg Brockman in the process.

YouTube CEO Neil Mohan, in a Bloomberg interview, stated that downloading transcripts or video segments violates YouTube's terms of service. However, his response remained non-committal when asked directly about OpenAI's use of YouTube data.

Further complicating matters, the report suggests that some within Google were aware of OpenAI's transcription practices but were constrained in their response due to Google engaging in comparable activities for its AI development. Google claimed to The NY Times that it only scrapes video data with the express permission of the content creators.

Moreover, the report mentions that in June 2023, Google purportedly adjusted its privacy policy to allow more extensive use of publicly available materials, like Google Docs and Google Maps reviews, for enhancing its AI products.