By now, you’ve probably heard about it (or maybe even tried it yourself): asking ChatGPT for therapy-like advice. Since 2022, large language models (LLMs) have quietly become the internet’s most popular mental health tool. They’re fast, fluent, and always available. But they come with zero clinical oversight and minimal safety guardrails. These models can reinforce stigma, miss nuance, and encourage harm. And they’re doing it at scale.
Meanwhile, the digital tools we trust – safe, evidence-based programs – are falling behind. Many rely on outdated technology: digitized workbooks, rule-based chatbots, and poor UX design. They’re safe, but the user experience pales in comparison to what people are used to, so they drop out before they feel better. It’s a dangerous paradox: the safe tools are boring, but the engaging ones aren’t safe.
Companies are starting to pay the price. Pear Therapeutics went all-in on FDA approval – only to go bankrupt when adoption lagged and reimbursement failed to materialize. Big Health’s DaylightRx has the FDA’s stamp of approval, but it’s built on outdated tech. And, Woebot is shelving its rule-based chatbot. This is an industry wake-up call. The digital mental health model isn’t just due for a shift; it’s overdue for a reckoning.
As Bridget Van Kralingen, former Senior Vice President of IBM Global Sales and Markets, once said: “The last best experience that anyone has anywhere becomes the minimum expectation for the experience they want everywhere.” If we want responsible technology to win in mental health, it needs to feel as modern as what’s already out there. That means rethinking how we build, evaluate, and scale these tools – so they’re not just safe, but smart, engaging, and continuously improving.
We’re using pharma-era tools to evaluate software-age solutions.
The randomized controlled trial (RCT) – the gold standard for clinical validation – was built for static products. Fixed interventions, long timelines, and high price tags may work for pills, but not for code – and especially not for tech start-ups with limited runway.
Run an RCT on a mental health app, and you’re frozen in time. Making any update to the app risks invalidating the results or triggering a long, costly re-approval process. So, most companies stop iterating and start selling. Innovation stalls.
The problem with this system is that it rewards stasis, discourages personalization, and punishes exactly what software does best: improving fast.
To be clear, RCTs remain essential. They’re unmatched in isolating active ingredients, determining clinical dose, and uncovering mechanisms of action. And active control arms are critical for understanding why something works, not just if it does. We shouldn't abandon RCTs; rather, we should use them strategically, when and where they add most value.
A recent report by the Peterson Health Technology Institute (PHTI) confirmed that digital Cognitive Behavioral Therapy (CBT) outperforms waitlists and delivers a solid ROI. Encouraging, yes – but few studies in the report explored what actually drives outcomes, especially when compared to active controls. And notably, the same report excluded AI-enabled tools entirely – highlighting how quickly our evidence base has fallen behind the pace of technology.
If the goal is to optimize real-world safety and effectiveness – when delivering previously validated clinical protocols – we’ve got to do things differently. Especially when AI is involved.
AI-delivered mental health tools promise more personalization, reach, and engagement. But they also introduce novel risks like hallucinations, emotional misattunement, and bias. These aren’t static threats; they evolve dynamically with each model. A single safety trial isn’t enough. And repeating every time the LLM updates? Unsustainable.
We need to evolve our evaluation model. One where evidence is generated continuously, not just once. One that rewards fidelity to proven protocols, even as the technology changes. One where clinical safety isn’t a box to tick – it’s an ongoing commitment.
This is evidence in motion.
This isn’t about lowering standards. It’s about redistributing rigor – embedding it throughout the product lifecycle and learning in the wild. Just like clinicians do.
The UK’s NHS Talking Therapies program already does this, tracking outcomes to drive quality. In the US, Reliant’s Precision Behavioral Health model has created an ecosystem for the implementation of digital care pathways, collecting real-world data to route patients and optimize engagement. While Reliant’s system hasn’t been designed for product iteration, it offers exactly the kind of clinician-guided infrastructure needed to enable safe, adaptive innovation at scale.
Regulatory frameworks are beginning to shift, but they still cling to static validation. Germany’s DiGA Fast-Track offers provisional reimbursement while collecting real-world data - but expects RCT-level evidence. The UK’s Early Value Assessment supports real-world evaluation with a broad definition of value - but without reimbursement. The FDA’s Predetermined Change Control Plan enables updates post-approval - but doesn’t enable innovations that emerge only in the wild. (Yet, a recent article by FDA leadership indicates the FDA is prioritizing its approach to AI regulation. Watch this space!)
None of them fully account for the messy, probabilistic nature of generative AI. They don’t yet accommodate systems that learn as they go.
In healthcare, “move fast and break things” is unacceptable when people’s lives are at stake. Yet, there’s an opportunity cost to standing still. As digital mental health tools become more dynamic, oversight must, too. That means shifting from static validation to real-time monitoring: ongoing, probabilistic, proactive. It also means treating safety not as a milestone, but a discipline.
We need the right infrastructure – technological, clinical, financial, regulatory – to support this shift. Because when science fuels product iteration, it doesn't slow innovation. It powers it.
In the AI era, enabling this new way of working demands new guiding principles. At ieso, we’ve been thinking about this – a lot. Our low-risk digital program for symptoms of anxiety and depression, Velora, is built using a continuous evaluation model grounded in five key principles.
With generative AI, risks are rare but potentially serious. Rigorous risk management and validation is essential – even with low-risk software. For example, every product update we make goes through large-scale in silico testing using artificial “patient” agents before release. We only ship updates that meet strict thresholds for safety, clinical fidelity, and quality. Iteration is rapid, evidence-informed, and always reversible.
AI shouldn’t operate without oversight and active human involvement – at least not yet. Using clinician-trained supervision agents to continuously monitor both user input and AI output allows us to proactively spot potential problems and react quickly. Clinicians review flagged responses, with clear escalation paths.
To generate useful data, tools need real users. That’s why implementation pilots built for learning are essential. Built-in measures like PHQ-9 and GAD-7 allow us to track outcomes directly in product workflows, so we’re optimizing for impact – not attention. Value-based care incentives can help fund pilots and ensure tools evolve while creating commercial viability for developers.
The best clinicians refine their practice over time. Digital tools can mirror this – but at software speed. Where one clinician might learn from one patient, one product update can learn from thousands. Every iteration reveals what works, for whom, and why. Scientists and clinicians working together can pave the way for the future of precision mental health.
None of this works without trust. And trust starts with transparency. That means clear communication about what a tool can and can’t do, how data is used, and where to go for more support. It also means designing with, not just for, people with lived experience. When tools are built this way, they empower users to make informed decisions that are right for them.
We’re at a crossroads in digital mental healthcare. Stick with outdated models, and innovation will grow outside the system – without safeguards. Embrace continuous validation, and we can guide it – safely, responsibly, impactfully.
The best digital tools won’t just prove they work once. They’ll prove they keep working, as they improve. And so, a call to action:
Digital mental health doesn’t need to choose between safety and innovation.
But it does need to move.