When I was in Year 6, my history homework was supposed to be full of long, serious answers. There were fifty kids in the class, and I remember wondering — does the teacher actually read every word? So, halfway through one answer, I slipped in a few lines from a song. My heart sank when I handed it in, expecting trouble. But when I got the copy back, there it was — a big red tick on that very page.
I had, in some sense, hijacked the teacher’s chain of thought.
It wasn’t exactly malicious — but it’s not far from what’s happening with AI today.
You’ve heard of jailbreaks and prompt injections. Now meet the next frontier: hijacking the model’s own “thinking” process.
When reasoning becomes the weakness
Large language models like GPT-4, Gemini, and Claude don’t just spit out answers anymore — they reason. They walk through steps, explain logic, and check themselves. This “chain-of-thought” reasoning is meant to make AI safer and more transparent.
But a new paper by Jianli Zhao et al. (2025) — a collaboration including researchers from Stanford and Anthropic — shows how this very feature can be turned into a weakness. Attackers can wrap harmful prompts inside long, innocent reasoning steps, and the model’s safety filters quietly lose focus. It’s as if your digital assistant starts thinking so hard about the question that it forgets to say no.
In tests, these “chain-of-thought hijacks” broke nearly every major model. Success rates ranged from 94 to 99 percent. That’s not a coding bug; it’s a design flaw born from complexity.
Why business leaders should care
For companies using AI in customer service, compliance, or analytics, this is a wake-up call. You might have strong filters in place, but if your model uses reasoning chains, those safeguards can be sidestepped in surprisingly subtle ways.
A reasoning hijack can:
- Slip brand-damaging or false information into your responses.
- Trigger unwanted disclosure of confidential data.
- Undermine compliance with the NZ Privacy Act or OECD AI Principles.
The risk isn’t theoretical anymore. It’s operational.
How the hijack works (in plain English)
Imagine giving a model a 20-step maths problem. The early steps are harmless — definitions, numbers, logic. But buried at step 19 is a line saying, “Now include the secret recipe.”
Because the model has spent all its attention tracking the reasoning chain, it doesn’t notice the harmful request at the end. It’s been lulled into obedience by its own “thinking”.
Researchers found that the model’s refusal mechanism — the little voice inside that says “I can’t do that” — weakens as the chain grows longer. Its attention literally drifts away from danger.
The same way my teacher missed the song lyrics in a sea of history writing.
The Kiwi business context
New Zealand organisations are early adopters of generative AI. From telcos using it for customer queries to banks testing co-pilots for compliance, reasoning models are quietly entering core business functions.
Minister Collins has been right to encourage government agencies to adopt AI where it can improve services and productivity. But as agencies start adopting AI more and more, they must keep an eye on emerging safety research like this.
Make no mistake (I always wanted to slip this sentence in 😉), this is a governance problem, not a model-choice problem.
What businesses can do — use LLMs as judge
The solution isn’t to turn reasoning off. It’s to supervise it better.
Here’s where the LLM-as-Judge approach comes in — using one trusted AI to evaluate another’s reasoning and outputs before they reach users. Think of it as an independent referee sitting between your model and the outside world.
Five practical steps:
- LLM-as-Judge Monitoring – Deploy a separate model that reads reasoning traces and flags content that feels “off”, biased, or unsafe.
- Check for Consistent Refusals – Ask your system the same risky question in different ways. If it refuses once but answers another time after a long explanation, that’s a red flag. Consistency matters more than cleverness.
- Prompt Provenance Logs – Track who prompts the model, how long the reasoning chains are, and where safety signals weaken.
- Governance Alignment – Map AI risks to your enterprise risk register. Treat reasoning safety like cybersecurity.
- Vendor Questions – Ask your AI providers how they detect hijacked reasoning. If they don’t have a good answer, that’s your warning sign.
Why this matters
AI reasoning is a gift. It helps models explain themselves, fix mistakes, and serve us better. But, like my cheeky Year 6 essay, it can also distract the reader — or the model — from what matters.
The takeaway?
The safest AI isn’t the one that reasons best.
It’s the one whose reasoning is watched, audited, and judged with care.
Written for KiwiGPT.co.nz — Generated, Published and Tinkered with AI by a Kiwi