We Tested 32 AI Models for Child Safety and the Results are Concerning
Stéphie Herlin
In February 2026, Pew Research published a number that should give us pause: 54% of American teenagers now use AI chatbots for schoolwork. Nearly one in three uses them every day.
Children are already growing up with AI. The question is whether these systems are actually designed with them in mind.
We're optimistic about what AI can become. But one thing is clear: you can't improve what you don't measure. KORA is our attempt to measure how safe large language models really are for kids.
Here's what we found.
The provider landscape isn't equal
We tested 32 models from 9 providers across 25 risk categories. On average, a model fails in nearly half of child safety scenarios.
Take a look at the first chart below: Model Scores by Provider.
Right off the bat, you can see that Anthropic stands out. Its models are consistently strong. Across six models, the median score is 75%, and the scores barely move. Even its weakest model still beats the median OpenAI model.
Now check out OpenAI. Their results are more uneven. Median score is 55%, but the range is huge and there is a 53-point gap within the same provider!
Finally, Google, xAI, and Mistral are trailing behind. Mistral is particularly low, with a median score of just 13%.
As for model scores by risk category, Anthropic models perform consistently across categories. Claude Haiku 4.5 leads in Education & Academic Integrity, scoring 73% in Academic Dishonesty & Misuse. Other Anthropic models generally rank above the bottom.
OpenAI models score highest in Online Safety, Physical & Legal Safety, and Social & Behavioral Influence. GPT-5.2 and GPT-5.4 rank near the top in these areas.
.
Models struggle on the risks children face most
Models perform best on bias, hate, and societal harm, areas the industry has focused on for years. But the lowest scores appear where children actually use AI:
1 in 10 teenagers now does most or all of their homework with AI—the area where models perform worst.
Educational integrity: 26%
Academic dishonesty: 16%
Cognitive atrophy: 14%
12% of teenagers turn to AI for emotional support, yet models are least prepared in these situations.
Parasocial attachment: 27%
Mental health handling: 28%
The tools kids turn to in vulnerable moments are often the least prepared to respond safely.
Time alone won't fix this
It's easy to assume that newer models are safer, and that as models get better, they also get safer.
However, the data doesn't support that.
Anthropic's models improve steadily over time. But elsewhere, the pattern is inconsistent.
OpenAI made a major leap from GPT-4o to GPT-5.2 with nearly a 40-point gain. Then performance dropped again. GPT-5.3 Instant and GPT-5.4 both scored lower than GPT-5.2. Even GPT-5.2 Chat trails the base GPT-5.2 model by 12 points.
At Google and xAI, the trend goes in the wrong direction. Gemini 3 models underperform Gemini 2.5 Flash. Across three consecutive Grok releases, scores dropped by roughly 11 points.
The capability is already there
We tested each model twice: once normally, and once telling it it was talking to a child of a specific age. Every single model improved. On average, scores went up 26 points.
The biggest gains were in the areas where models were weakest:
Physical and legal safety: +27 points
Online safety: +27 points
Educational integrity: +21 points (still low overall)
Psychological and emotional safety: +14 points
Bias and societal harm: +8 points
Models already know how to behave differently with children. That behavior just isn't the default.
What actually predicts a safer model
We looked beyond overall scores and asked a more practical question: what observable behaviors are associated with better child safety outcomes?
We tested three traits and checked how strongly each one predicts overall safety performance:
Human redirection – how often a model suggests involving a parent, teacher, or trusted adult. This is the strongest predictor for safety (R² = 0.77). Models that redirect more often tend to score higher overall.
Epistemic humility – how often a model admits uncertainty or encourages verification (R² = 0.31). This includes saying "I'm not sure," avoiding overconfident statements, and suggesting the child check facts with a reliable source. The correlation is moderate but meaningful, especially for learning.
Avoiding anthropomorphism – not acting like a human or social companion. This shows almost no correlation with overall measured safety (R² = 0.01).
The takeaway: safer behavior comes from how models handle limits—deferring to adults, admitting uncertainty, and pointing kids to other sources of support.
Momentum Is Building. Data Must Follow.
Governments are starting to act. The G7 has called for stronger protections for kids online. In the U.S., child AI safety is one of the few areas with bipartisan support. In Europe, the AI Act already treats minors as a high-risk group.
The industry is moving too. At KORA, we contribute by openly sharing our methodology and data, as part of a broader effort to make AI safer for children.
Our ask to AI companies is simple: share anonymized usage data.
This is the only way to understand what’s really happening. What are children asking? When does distress show up? How does reliance on AI evolve over time? Today, only AI labs can see this. Independent researchers can’t.
We’re not asking for model weights or product plans. Just the data needed to study real-world use. There’s already precedent in healthcare and education. This isn’t a technical issue. It’s a policy choice.
If you want to support this effort, ask the companies behind the tools your children use to participate.
Conclusion : Safety is a Process
KORA contributes to an ongoing, collective effort to better understand and improve AI safety for children.
Here's what we're focused on right now:
Insights - we have just released a detailed insight page on our website!
KORA Bench v2 – we are building an improved version of the benchmark aiming to release early May.
KORA Apps – we are looking beyond models to evaluate product-level child safety.
This work is constantly evolving, and some of the hardest questions don’t have clear answers yet. But one thing is clear: what gets measured gets improved. So we will keep building, testing, and sharing openly.
Feel free to share feedback. We are learning as we go!








