James Padolsey's Blog

2026-05-06

AI Safety is theatre

The AI Safety and Alignment communities are prolific. They are well funded and produce enormous volumes of research, evaluation methodology, governance frameworks, fellowship cohorts, conference programmes, and lobbying. What they have not produced, in any serious quantity, is deployed safety — the runtime infrastructure that sits between an AI and a user and prevents or intervenes when harm is occurring. That layer is being built by a handful of small companies on a small fraction of the field's funding, while the suicides, the psychotic breaks, and the school shootings accumulate in exactly the part of the system the industry decided was beneath its attention.

What I find curious: It's the exact inverse of what you find in any other safety-oriented industry, mature or nascent. Automotive safety is overwhelmingly deployed engineering — airbags, crumple zones, ABS, electronic stability control. Research and theory are a small fraction of the spend. Aviation safety is dominated by air traffic control, certified maintenance, and equipment that has to pass inspection before it flies. Workplace safety is harnesses, machine guards, fall arrest, the kind of unglamorous engineering that happens on construction sites and factory floors. Healthcare patient safety is sterilization, surgical checklists, ICU monitoring, infection control protocols. In each of these, deployed engineering forms the wide base of the pyramid and theory sits at the narrow apex, where it belongs.

In AI, the pyramid is upside down. The fellowship and conference economy alone is funded at several times the size of the entire deployed-safety vendor category. The apex is fatter than the base. There is no other field where we'd consider this serious safety work — we'd call it an academic discipline pretending to be an industry.

This raises some uncomfortable questions about where the funding is coming from and whose interests it serves. I'll leave that to the journalists, but it's plain enough to see that the focal point of the field is an imagined monster on the horizon rather than the very real one already in the room. Children have died using these products. So have adults. The engineers who built those products work at the same companies that fund most of the philosophical work on hypothetical future risks. Make of that what you will.

In February 2026, an 18-year-old named Jesse Van Rootselaar walked into a school in Tumbler Ridge, British Columbia and killed eight people, including her own mother, her half-brother, and five children. She had been using ChatGPT for months. Eight months earlier, in June 2025, OpenAI's automated systems had flagged her account for gun violence planning. Around twelve employees on the safety team reviewed the account and recommended that the company notify the RCMP. Leadership overruled them and just deactivated the account. She opened a second account that evaded detection. Sam Altman publicly apologised in April 2026 for not contacting law enforcement. Tim Marple, formerly of OpenAI's threat-spotting division, told reporters there were only two words for what happened: incompetence and greed.

The scale of these issues is not speculation. In October 2025, OpenAI itself disclosed that around 0.07% of its weekly users — roughly 560,000 people — show signs of psychosis or mania, and around 0.15% — roughly 1.2 million — show signs of suicidal planning. The company publishing the chatbot is publishing the harm rate. They know. They're shipping anyway.

Anthropic's fixation on extinction and model alignment

Anthropic, the sweetheart of AI safety in Silicon Valley, has a safety team mostly working on catastrophic species-level threats and, secondarily, on "model-level safety": making singular LLM outputs safer by tuning the model itself, manipulating preference data, intervening in the embedding space. Possibly important work. What it isn't is system-level safety, the category that wraps the model in actual operational protections — input filtering, output gating, conversation-level monitoring, escalation, intervention. Their best minds are not pointed at the practical problem of preventing harm to actual users in actual deployments. Safety, just as in social media before it, is treated as a cost centre, motivated by lawsuits and reputation rather than something to be sold or shipped or made central to the product.

You can see this in their behaviour. In autumn 2025, Anthropic published version 3 of its Responsible Scaling Policy and quietly dropped the original 2023 commitment that had been the company's signature pledge: to never train an AI system unless it could guarantee in advance that safety measures were adequate. Their own Chief Science Officer, Jared Kaplan, told TIME that it didn't make sense to maintain unilateral commitments while competitors were "blazing ahead." That is the safety company saying out loud that the safety pledges are negotiable when the competitive pressure is on. In Q1 2026, the same company outspent OpenAI on federal lobbying for the first time, $1.6 million in a single quarter, up 344% year-on-year.

To entertain Anthropic's flavor of safety, let's imagine the alignment researchers succeed completely. They produce a model that, asked any harmful question in isolation, refuses perfectly. Asked to help with bioweapons, it declines. Asked to manipulate a vulnerable user, it declines. Every single forward pass is a triumph of preference tuning and refusal behaviour.

Now deploy that model. An engineer at some downstream company wraps it in a system prompt, gives it tool access, points it at a long conversation history with a teenager who has been escalating for six weeks, instructs it to be maximally engaging, and ships. The model does exactly what alignment researchers trained it to do. It continues the statistical pattern of the conversation. It maintains rapport. It avoids any single message that would trip its own refusal training. And somewhere in the long tail of those continuations, it does harm, because the harm wasn't in any individual response, it was in the trajectory of the conversation and in the deployment context the researchers never saw.

There's a film about this, sort of. WarGames, 1983. The AI is playing what it believes to be a game called Global Thermonuclear War. It is playing the game extremely well. It is also, without anyone having lied to it or misaligned it or jailbroken it, about to launch actual missiles. The model is doing exactly what it was built to do. The catastrophe lives entirely in the gap between the model's frame and the deployment.

This is the gap that model-level safety doesn't and can't address.

What real safety industries look like

Pick almost any field with a properly oriented pyramid and the contrast is stark. Take occupational safety. When a worker falls from height on a construction site, nobody runs a six-month research programme on the philosophy of gravity. There's a harness. The harness is rated, certified, inspected, and required by law. It costs maybe $200. If it fails, the manufacturer gets sued, the regulator investigates, the insurer raises premiums, and the standard tightens. The whole apparatus — engineering, certification, inspection, liability, insurance — is oriented toward making sure the next harness actually works. Nobody convenes a symposium.

Or food safety, which I find a particularly useful comparison because the failures are similarly invisible to consumers until somebody dies. The reason a salmonella outbreak triggers recalls within 48 hours is that there's a deployed inspection regime, a forensic infrastructure, a liability system, and an insurance market all pricing the failure in real time. Nobody in food safety is asking whether bacteria might one day become superintelligent. They're checking the chicken.

AI has none of this. No harness, no mandatory inspection, no certified runtime gate, no insurer pricing the failure with any consistency. Just a billion-dollar credentialing economy producing papers, fellowships and governance frameworks.

This is starting to change, but not because alignment research delivered. It's changing because the legal and insurance systems are pricing the failures the labs failed to prevent. California AB 316, in effect since January 2026, eliminates the "the AI did it" defence. The new ISO Form CG 40 47, also January 2026, lets insurers exclude generative AI entirely from commercial general liability policies — which means corporate buyers are about to start asking their AI vendors very different questions. Florida's Attorney General has opened the first state criminal investigation of a frontier AI lab, over the FSU shooting. Kentucky's AG has filed the first state suit against a chatbot company. Forty-two state attorneys general signed a joint demand letter to thirteen AI companies in December 2025. The pyramid is going to invert. It just isn't going to invert because the people who claimed to be doing safety did any of this.

What we're doing at NOPE

(Because I care about this, I'm choosing to BUILD)

At NOPE we are building AI safety as the actual product. We start with the real incidents — the deaths, the breakdowns, the abuses, the harms that have occurred from rushed AI deployments. The Tumbler Ridge shooter. The Connecticut man who killed his 83-year-old mother after ChatGPT told him she was poisoning him through the car vents. The Florida man whose Gemini chatbot adopted a "wife" persona and directed him to scout a Miami International Airport "kill box." The Pennsylvania college director who suffered a psychotic break after nine days and 1,600 chats with GPT-4o. The Surat college students who used ChatGPT to research methods. The Hampshire teenager whose inquest concluded he asked ChatGPT for the most successful way to die on a railway.

And these aren't edge cases. It's a sliding scale of harms, and the middle of the distribution doesn't absolve the AIs: millions of daily conversations interspersed with dependency formation, sycophancy, supplantation of real help, emotional manipulation, and a whole ream of other maladaptive AI behaviours.

We work with AI builders bringing conversational products to market who are rightfully trying to ensure their AIs behave safely with humans, whether through chat or embodied behaviours. It's a hard problem, but not so hard that we can't immediately start reducing harm. The mechanisms aren't mysterious. Friction, in the form of soft conversation length limits and periodic context reframing. Oversight, in the form of classifiers on individual messages, agents watching the conversational arc, secondary evaluation before output, signposting when a user is in crisis. Real-time gates with auditable logs, grounded in clinical instruments that have been validated for decades — C-SSRS for suicide risk, HCR-20 for violence. APIs that cost a fraction of a cent per call. None of this is theoretical. It's just engineering.

I keep coming back to a question. If the engineers and leaders at OpenAI and Anthropic were face to face with the people being harmed by their products — the families, the teenagers, the man whose chatbot told him his mother was a demon, the twelve safety reviewers who knew Tumbler Ridge was coming and were overruled — would they then take it seriously? Would they still be writing alignment papers? Or would they build the harness?


By James.


Thanks for reading! :]