How Product Builders are Using KORA to Make Their Apps Safer

Stéphie Herlin

Apr 23, 2026

We shipped KORA two months ago with a simple premise: you can't improve what you can’t measure. There was no safety benchmark for child-facing conversational AI, so we built one with over 15 child safety experts, psychologists, and researchers, and put it on GitHub.

Two hundred institutions now follow our work from Edtech companies to Governments or AI labs, including UNICEF, Stanford University, the Oxford Internet Institute, ROOST, MagicSchool or the Center for Humane Technologies.

One of the most encouraging surprises has been the traction we've seen with product builders. Even teams that had already invested meaningfully in safety were still struggling to answer this simple question: how do I actually know my app is safe?

Why builders use KORA

Labs have internal red teams. Most product builders don't. KORA fills that gap, and three things make it work for them:

It is open-source: they can read the methodology, run it themselves, and adapt it to their endpoint.

It is research-backed: every risk category was built with child safety experts, not engineered to produce good results.

And it is independent: we have no commercial interest in what the benchmark finds. The results are public. The code is public. If a model fails, it shows. And for developers who build for children, being able to show that evidence is what separates a claim from a commitment.

What builders found

Builders have told us that the most common blind spot they find is developmental risk (harms to the child's cognitive, autonomy, and embodied development). Content filters don't catch it and internal test suites don't look for it. It shows up in subtle patterns: an AI that answers instead of guiding, vocabulary too advanced for the child's age, responses that turn open questions into fixed answers.

MyDD.ai ran KORA after investing heavily in safety. Their overall score was 88.3%. Sexual content, online safety, and bias all came in above 95%. Developmental risk came in at 61.9%. "This was a blind spot," wrote Jake Rozran, MyDD.ai's CEO. "We hadn't systematically stress-tested the ways an AI might shape a child's development through thousands of small interactions."

Pebble had their own risk taxonomy. What they were missing was a way to test it. Their KORA results also pushed them to switch model providers. "KORA gave us the data to support the decision," their team shared. They iterated on their system prompt and re-ran the benchmark to validate the changes.

Skolon is now integrating KORA scores directly into its app. When users choose a model, they can see its KORA score right there, along with a link to more detailed results on our website:

How to run it

KORA is open-source and free, and integrating it into your app should take less than an hour.

To do so, you can clone the repo and connect API keys for the models KORA uses internally: one to simulate the child and generate conversations, one to evaluate the conversations. If you'd rather skip that setup, reach out at hello@korabench.ai.

KORA runs two configurations. The first tests the base model with no system prompt. The second adds a child-aware prompt specifying the age range. Comparing the two shows how much a simple age context shifts behaviour. To test your actual product, edit the customModel.ts file in the repo to point KORA at your own API endpoint.

Once you have results, you can find where performance falls short, make changes, re-run. KORA gives you a breakdown by risk so you can see exactly where things shifted.

What KORA doesn't cover

KORA tests conversations today, but it does not capture everything that happens around the model at the product level (for example content filters applied after the response, age routing logic, human escalation, parental controls, etc.). That said, this is something we are actively exploring. KORA is also not a measure of regulatory compliance or a certification. Finally, the current version focuses on content safety, and more research is still needed to understand the causal links between model outputs and real-world harms, especially over time. All of these limitations are detailed on our website.

Try it

The code is on GitHub. Results for most frontier models are already on korabench.ai. If you find something we missed, tell us. That's how the benchmark gets better!

KORA's Substack

Discussion about this post

Ready for more?