Hidden horrors of AI: Researchers find child abuse material in image generation dataset

Researchers from University of Washington and Allen Institute for AI found that the dataset contains images depicting sexual abuse of children

Alexander Lewis - Dec 21, 2023

A representative picture of a sad child. — Canva

Researchers have discovered that a large dataset used to train AI image generation models contains child sexual abuse material (CSAM) among its five billion-plus images. The dataset, called Stable Diffusion, was created by researchers from the University of California, Berkeley, and has been widely used by the AI community.

However, a team of researchers from the University of Washington and the Allen Institute for AI found that the dataset contains images that depict sexual abuse of children, as well as other illegal and harmful content.

How child abuse dataset was created and used

Stable Diffusion is a dataset that consists of images scraped from the internet, without any filtering or moderation. The dataset was created to train AI models that can generate realistic and diverse images from text descriptions, such as Civitai, a controversial platform that allows users to create images of anything they want.

The dataset was published in June 2021 and has been downloaded by hundreds of researchers and developers.

How CSAM was detected in AI

The team of researchers who discovered the CSAM in the dataset used a tool called PhotoDNA, which is a technology developed by Microsoft to identify and report CSAM images. The tool compares the images in the dataset with a database of known CSAM images and flags any matches.

The researchers found that the dataset contains at least 1,000 CSAM images, as well as other images that depict violence, gore, hate symbols, and nudity.

The researchers reported their findings to the creators of the dataset, as well as to the National Center for Missing and Exploited Children (NCMEC), which is a non-profit organization that works with law enforcement agencies to combat child exploitation.

The researchers also contacted the platforms that host the dataset, such as Google Drive and GitHub, and asked them to remove it.

Implications and challenges

The discovery of CSAM in the dataset raises serious ethical and legal issues for the AI community, as well as for the platforms that host and use the dataset.

The researchers warn that the dataset could potentially expose users and developers to legal liability, as well as harm the victims of child abuse. Moreover, the dataset could enable the creation and dissemination of more CSAM images, as well as other harmful content, by using AI models trained on it.

However, the researchers also acknowledge the challenges of creating and maintaining large-scale image datasets that are free of harmful content. They suggest that the AI community should adopt more rigorous standards and practices for data collection, curation, and moderation, as well as for data sharing and usage.

They also call for more collaboration and coordination among researchers, developers, platforms, and law enforcement agencies to prevent and combat the spread of CSAM and other illegal and harmful content in AI datasets and applications.