Nvidia is on the receiving end of quite a blast after it emerged that the company asked its employees to scrape data from sites like Netflix, YouTube and other online sources to obtain data to train its various AI products.
The revelations were made via a report published by 404 Media which also used Slack chats and company emails to confirm the report besides contacting a few unnamed employees.
Nvidia helped itself to “a human lifetime visual experience worth of training data per day,” Ming-Yu Liu, vice president of Research at Nvidia and a Cosmos project leader, admitted in a May email.
When querying about the legality of the project, named Cosmos, employees were assured that the move had been given clearance from the highest levels of the company.
Read more: Nvidia under fire — DOJ launches dual antitrust probes into AI dominance
The project sought to replicate a foundation model, akin to Gemini 1.5, GPT-4, or Llama 3.1, “that encapsulates simulation of light transport, physics, and intelligence in one place to unlock various downstream applications critical to Nvidia.”
As part of the project, machine learning and an open-source video downloader were used thus avoiding YouTube’s attempt to block it.
According to emails viewed by 404, managers discussed using 30 virtual machines running on Amazon Web Services to download 80 year’s worth of full-length content every day.
For its part, Nvidia claims no wrongdoing. “We respect the rights of all content creators and are confident that our models and our research efforts are in full compliance with the letter and the spirit of copyright law,” an Nvidia spokesperson told the news outlet via email.
“Copyright law protects particular expressions but not facts, ideas, data, or information. Anyone is free to learn facts, ideas, data, or information from another source and use it to make their own expressions. Fair use also protects the ability to use a work for a transformative purpose, such as model training.”