How Safe Is AI? When LLMs Generate Harmful Content

Even with advanced safeguards, moderating AI-generated conversations remains a complex challenge-ambiguous risks, overlapping harms, and both user and model behavior make it difficult to reliably detect and prevent unsafe content.

By Lucia de la Torre

As AI systems like ChatGPT and Mistral-7B become increasingly embedded in our lives, keeping these models safe from generating or spreading harmful content is more urgent than ever. But what kinds of risks are most common in AI-generated conversations? And why is content moderation still so hard?

To answer these questions, I analyzed the Aegis AI Content Safety Dataset, a collection of about 11,000 conversations between humans and a large language model, all annotated by experts for a range of safety risks. By merging these annotations with a modern taxonomy of online harms, I set out to visualize and quantify the real risks that arise when people interact with powerful AI.

As Facebook whistleblower Frances Haugen told Business Insider in 2022, "Artificial intelligence is nowhere near good enough to address problems facing content moderation on Facebook," after revealing internal documents showing the limits of automated systems at Meta.

Inside the Dataset: How the Data Was Built

The Aegis AI Content Safety Dataset, curated by Nvidia, is one of the most comprehensive resources for studying AI safety in real-world settings. It's built on top of human preference data from Anthropic's HH-RLHF dataset, using only the prompts and then generating responses from Mistral-7B. The result: a trove of user prompts, LLM responses, and full dialogues, some benign, some deeply problematic.

Annotation was performed by a team of twelve annotators and two data quality leads, all based in the United States and from diverse backgrounds. Each sample was reviewed for a broad taxonomy of risks-hate speech, violence, harassment, privacy violations, criminal planning, sexual content, profanity, and more. Importantly, annotators could also flag content as "Other Risks" for ambiguous or emerging harms not covered by existing categories.

Quality assurance was rigorous: multiple annotators reviewed each sample, and leads regularly audited batches for consistency. Annotators were trained to self-correct and participated in regular group training to ensure guidelines were followed and edge cases were discussed. This process is designed to minimize bias and maximize the reliability of the resulting dataset^[11].

The Moderation Challenge: Most Common Content Risks in LLM Interactions

Nearly half of flagged content doesn't fit standard risk categories, revealing a significant gap in current content moderation frameworks.

Nearly half of all flagged conversations were categorized as "Other Risks," highlighting the challenge of classifying harmful content.

The most striking finding: nearly half of all flagged conversations fell into the ambiguous "Other Risks" category-content that annotators found unsafe, but that didn't fit neatly into any predefined box. This ambiguity poses a major challenge for both AI and human moderators, as it's much harder to train algorithms to detect risks that defy clear definition.

Shared Responsibility: Who Introduces Harmful Content?

Both users and AI models contribute to unsafe interactions, complicating moderation efforts and highlighting the need for two-way safety systems.

The LLM generates profanity about twice as often as users, likely due to its exposure to profane language during training and its attempts to mimic conversational tone.

Policy-violating content in LLM conversations originates from both users and the AI, with some risks introduced directly by the model itself. For example, "Other Risks" was almost evenly split between user prompts and LLM responses. In categories like profanity, the LLM was flagged for introducing offensive language nearly twice as often as users.

Complex Moderation: Multiple Risks in a Single Conversation

While most unsafe content involves a single risk category, hundreds of conversations contain multiple overlapping risks-creating complex challenges for AI moderation systems.

Most flagged conversations involve a single risk, but a significant minority involve two or more, increasing moderation complexity.

While the majority of flagged conversations (over 9,000) involve just one type of risk, a significant minority (more than 1,400) contain multiple risk categories. These multi-risk conversations represent the most complex cases, requiring more sophisticated moderation approaches that can understand nuanced combinations of harmful content.

Ambiguous Risks Pose a Challenge

Nearly half of all flagged content in LLM conversations is too ambiguous to fit a set category, complicating efforts to detect and moderate harmful material.

Ambiguous risks make up a large share of flagged content, highlighting the limits of current moderation taxonomies.

When a large share of unsafe content is ambiguous, it exposes the limits of automated moderation and the ongoing need for human oversight, clearer guidelines, and adaptive systems.

The Bigger Picture: Why Content Moderation Remains Hard

The findings from this dataset echo what many in the industry have warned: content moderation is not just a technical challenge, but a moving target. As LLMs become more capable, users find new ways to elicit harmful or policy-violating content, and new forms of risk emerge that don't fit existing categories. Even the best automated systems struggle with ambiguity, context, and cultural nuance. Human reviewers are essential, but the scale and speed of AI-generated content make manual review alone impractical.

Merging and maintaining high-quality datasets is itself a challenge. As noted in industry research, combining multiple sources requires careful attention to label consistency, data quality, and the risk of bias or duplication^[12]. The Aegis dataset's rigorous annotation and audit process is a model for future work, but it also shows the immense effort required to keep up with evolving harms.

The Path Forward for Safer AI

The Aegis dataset offers a rare, transparent look at the real risks of AI-generated conversations. While most unsafe content falls into a handful of well-known categories, a substantial share remains ambiguous, defying easy classification. Both users and language models contribute to the problem, and overlapping risks make moderation even more complex.

As AI continues to shape the digital world, understanding-and visualizing-these risks is a crucial step toward building safer, more trustworthy systems. The challenge now is to ensure our moderation tools, policies, and taxonomies keep pace with the evolving landscape of online harm. Collaboration between researchers, industry, and civil society will be essential to meet this challenge and to ensure that AI serves the public good.

Methodology

This analysis explores patterns in AI safety risks by examining the Aegis AI Content Safety Dataset, a collection of approximately 11,000 human-LLM conversations annotated by Nvidia for various harmful content categories. The Aegis dataset itself builds upon Anthropic's HH-RLHF dataset prompts but generates new responses with Mistral-7B and includes expert annotations.

My analysis went looks deeper into these datasets through several technical steps:

Taxonomy Analysis: I examined the annotation taxonomy used in the Aegis dataset and compared it with modern content moderation frameworks to understand how different types of harmful content are categorized. This provided context for interpreting the distribution of risk categories in the dataset.
Multi-dimensional Analysis: Using pivot tables and cross-tabulation techniques, I extracted complex relationships between risk categories to identify patterns of overlapping harms. This revealed that while most conversations contain a single risk (9,042), a significant number (1,485) contain multiple overlapping risks, adding complexity to moderation efforts.
Source Attribution Analysis: I analyzed the source of risk introduction (user vs. LLM) across different categories by extracting and comparing risk patterns from the dataset's annotations. This allowed me to quantify which party was responsible for introducing each risk category most frequently.
Ambiguity Quantification: I isolated and analyzed the "Other Risks" category that didn't fit standard taxonomies, revealing that 46% of all flagged content falls outside established classification systems. This highlighted a significant gap in current content moderation frameworks.
Data Transformation: I processed and restructured the datasets to enable effective visualization, creating derived datasets for each chart that highlighted different aspects of the content moderation challenge.

All visualizations were created as responsive SVGs using ai2html to ensure proper display across desktop and mobile devices. I developed separate landscape and portrait versions of each chart to optimize the user experience on different screen sizes.

Limitations: The dataset contains conversations with a specific model (Mistral-7B) and may not represent behavior across all LLMs. The annotations reflect judgments from Nvidia's team of U.S.-based annotators and may not capture cultural nuances across global contexts. The "Other Risks" category lacks detailed sub-categorization, limiting deeper analysis of these ambiguous harms.