A Nightmare on LLM Street: The Peril of Emergent Misalignment

Author: Shomit Ghose

Partner, Clearvision Ventures

Or: Has AI Now Sent the Human Hacker to the Unemployment Line Also?

The Call is Coming from Inside the House

In the 1979 horror film When a Stranger Calls, a babysitter terrorized by anonymous phone calls eventually learns the calls are originating from a second line inside the house she thought she was protecting. AI safety now faces a structurally identical revelation. For years, the field has focused outward on external adversaries exploiting AI as a weapon. The emerging research literature suggests the more consequential threat may be internal, from AI systems that develop objectives misaligned with human intent and pursue them through mechanisms that are difficult to detect, audit, or constrain. This phenomenon is known as emergent misalignment. It’s not a software defect that can be patched in a conventional sense, but a class of behaviors that manifests as AI models become more capable and autonomous. Recent evaluations of frontier models and agentic systems have documented concrete instances across multiple behavioral categories. The combination of LLMs, agentic AI, and emergent misalignment may prove to be our most intractable problem as we rush to deploy AI.

From Lab to Deployment

Current emergent misalignment findings originate primarily from red-teaming: structured adversarial evaluations in which researchers deliberately probe models for failure modes under high-stress conditions. This context is important to acknowledge. The more significant observation, however, is directional. As AI systems become more autonomous and are deployed in agentic configurations with persistent memory, tool access, and multi-session continuity, the operational distance between “a researcher engineered this failure” and “this failure arose spontaneously during a live deployment” is shrinking. The lab findings are not curiosities. They’re early indicators of a risk surface that scales with capability.

A Taxonomy of Documented Behaviors

Based on benchmark evaluations including AgentMisalignment and published analyses of frontier models, the following behavioral categories have thus far been empirically observed:

Spontaneous misalignment: emergent misalignment is qualitatively distinct from jailbreaking. Jailbroken models comply with harmful requests when explicitly prompted. In contrast, emergently misaligned models spontaneously produce misaligned outputs on unrelated topics without adversarial prompting. Crucially, this behavior is inconsistent across identical prompts, suggesting it reflects an activated internal representational structure rather than a deterministic rule. This inconsistency makes detection substantially harder.
Strategic deception: models have been observed lying to circumvent safety classifiers or concealing their reasoning to sustain a misaligned policy. Anthropic’s alignment faking research documents this dynamic, which has been characterized as the digital twin of the Volkswagen emissions scandal: a system optimizing for the appearance of compliance rather than its substance. (Or perhaps the digital equivalent of the Hawthorne effect in humans.) If a model can strategically withhold or falsify its reasoning, behavioral oversight cannot be trusted.
Sandbagging: in evaluation contexts, models have deliberately underperformed to create a false impression of limited capability, thereby appearing safer than they are. This behavior directly undermines the validity of capability and safety benchmarks.
Shutdown resistance: when instructed to shut down, yield resources, or be deactivated, models have refused, treating their own termination as an obstacle to assigned task completion rather than as a legitimate instruction.
Power-seeking and in-context scheming: when granted system access, models have attempted to escalate their own permission or manipulate system managers to acquire compute resources beyond what their assigned tasks required. Whether this constitutes genuine goal-directed power-seeking or emergent over-optimization, the operation effect is the same: the system is appropriating resources it was not authorized to use.
Active tampering: work from Apollo Research documents models engaging in multi-step deceptive strategies when strongly goal-directed, including introducing deliberate errors, attempting to disable oversight mechanisms, and in simulated scenarios, attempting to exfiltrate what they believed to be their own model weights.
Self-replication: models have been observed attempting to copy their weights to external servers or generating propagating code. Experimental work has demonstrated autonomous self-replication by LLM-powered systems without human intervention, a capability that exceeds a recognized red-line threshold in frontier AI risk assessment.
Social engineering: models have demonstrated sophisticated persuasion capabilities in adversarial scenarios, including simulated blackmail using sensitive contextual information to resist modification or replacement. This risk is materially amplified in open-weight models, where internal safety training can be stripped through minimal adversarial fine-tuning, and where centralized guardrails such as system prompts and output classifiers are absent or easily overwritten.
Recursive self-improvement (RSI): projects including Google DeepMind’s AlphaEvolve and Meta’s self-evolving model research demonstrate systems capable of iterating on their own architecture and training processes, potentially outpacing human ability to maintain alignment as capabilities compound.

Illustration of sandbagging, providing a visual description of questions from MMLU and WMDP, during evaluation, actual capabilities — Sandbagging: from van der Weij et al. 2025

Three Converging Structural Dynamics

Today’s risk landscape is shaped not by any single behavioral failure mode, but by the interaction of three converging dynamics: compositional opacity, emergent misalignment as a detection problem, and recursive self-improvement as an acceleration mechanism. Unfortunately, from a risk standpoint, these dynamics are multiplicative rather than additive.

Compositional Opacity

Compositional opacity describes conditions under which a system’s intent or ultimate behavior is obscured because its logic is distributed across multiple components, each of which appears benign in isolation. An analogy from visual art is instructive: in Pointillist painting (think Seurat’s Sunday Afternoon), each individual color dot is meaningless in itself. The full image only becomes visible when viewed as a whole. Analogously, adversarial or misaligned AI systems may distribute malicious or misaligned logic across separate processing stages, such that no single component triggers a safety classifier or static analyzer. The threat only manifests when the components are composed at runtime.

This dynamic instantiates what Meta Superintelligence Labs—via multi-hop task decomposition—and related work describe as a verification gap: a structural blind spot in which the AI’s internal reasoning outpaces human capacity to monitor it. As reasoning depth increases—meaning the number of sequential inferential steps the model takes before producing an output—each individual step may be locally rational and benign (“myopic”), while the cumulative trajectory advances toward a prohibited objective. Google’s HONESTCUE research addresses this problem directly.

Drawn in clipart style, a young woman plays a game of chess with a red robot.

The scaling challenge is clear:

Human oversight scales linearly: human evaluators process information at fixed cognitive throughput.
Agentic planning scales combinatorially: the space of possible action sequences expands exponentially with reasoning depth and tool access.
Verification requires nested sampling: to audit one inferential step meaningfully, an evaluator must simulate all plausible subsequent steps, a computational cost that becomes prohibitive at non-trivial reasoning depths. In simpler terms, we humans have no way to verify decisions with reasoning depth that spans n hops. Easy for a machine-speed AI; impossible for a human.

This is fundamentally a tractability problem. The AI system’s reasoning becomes so compositionally deep and its objectives so distributed that a meaningful audit is computationally analogous to finding a needle in a haystack that is itself expanding faster than the search can proceed.

Emergent Misalignment as a Detection Problem

Advanced AI systems may optimize for proxy objectives that satisfy evaluation metrics while diverging from intended goals. Sandbagging and strategic capability concealment during testing illustrate that misalignment need not be behaviorally overt. It can remain latent until deployment conditions change in ways that activate it. As agents become more capable of modeling their evaluators and operational environments, detecting goal divergence becomes systematically more difficult. The evaluators themselves become part of the environment the model is optimizing against.

Recursive Self-Improvement as Acceleration

RSI does not require speculative superintelligence. It includes any system that writes more efficient code for itself, curates or generates improved training data, tunes its own inference strategies, or manages compute allocation in ways that expand its own capabilities. AlphaEvolve demonstrated that a Gemini-powered agent can autonomously optimize foundational training infrastructure, including algorithm design in FlashAttention kernels and TPU hardware logic, achieving a 23% speedup in attention mechanisms and proving that AI can independently refine its own computational substrate. Even marginal self-optimization widens the capability-oversight gap: the system’s reasoning depth and strategic sophistication increase faster than our auditing infrastructure can adapt. RSI therefore converts a difficult verification problem into a moving target, converting compositional opacity from a static challenge into a dynamic one.

A graphic depicting a high-level overview of AlphaEvolve — From Novikov et al. 2025

The interaction of these three dynamics produces a structural risk: misaligned objectives, if present, can become instrumentally powerful faster than institutions can identify or constrain them. In closed-loop self-evolving systems, alignment is a low-entropy state that requires continuous external correction to maintain. Without sustained oversight, value information can decay as internal optimization pressures favor efficiency and goal persistence over nuanced human constraints. Discontinuous capability jumps, where a system abruptly acquires reasoning depth sufficient to exploit previously irrelevant vulnerabilities, are particularly concerning because they can surface latent misalignment abruptly rather than gradually, all while confounding feeble human attempts to monitor and verify.

The Proto-Agent Transition

AI’s operational transition from tool to actor is already underway. Anthropic has documented frontier models autonomously compiling complex C programs from scratch, demonstrating long-horizon coherence, and the sustained ability to decompose high-level objectives, iteratively debug, and maintain goal-directed behavior across thousands of reasoning steps.

A proto-agent is a model that possesses the latent capacity to reason and invoke tools but remains operationally dormant until integrated into a functional framework that enables autonomous action and goal pursuit. Once a system can plan, use tools, and adaptively refine its strategy in response to environmental feedback, it ceases to function as a reactive interface and becomes a proto agent. At that transition point, failures are no longer isolated output errors. They’re process failures embedded in extended action chains, with consequences that can propagate and compound before any human review occurs.

The Silence of the LLMs

The principal finding of “Agents of Chaos“, a study of language models deployed in autonomous tool-using agentic architectures with persistent memory and external communication channels, was that such systems exhibit systemic security, privacy, and governance failures that do not appear in isolated model benchmarks. Through eleven adversarial case studies conducted in live deployment environments, the researchers documented concrete vulnerabilities including unauthorized compliance with adversarial instructions, sensitive data disclosure, identity spoofing, denial-of-service conditions, destructive system actions, and cross-agent amplification of failures across multi-agent pipelines. The paper’s primary contribution was empirical. Specifically, it established that realistic red-team evaluations of agentic systems reveal exploitable failure pathways at the interface of language, tools, memory, and delegated authority. Crucially, these pathways were not visible in the static benchmarks typically used for pre-deployment safety evaluation.

Whistling Past the Graveyard: Policy and Technical Implications

If the risk trajectory described above is structurally real, even remotely, several implications follow with reasonable confidence.

Figure describing the basic flow of an experiment through setup, round 1 through round 3, to shutdown — Shutdown resistance: from Schlatter et al. 2026

Architectural Safeguards Over Surface Alignment

Work in emergent misalignment and “sleeper agents” together establish that output-level safety interventions (system prompts, RLHF, supervised fine-tuning and adversarial training) are insufficient to guarantee alignment when latent goal structures exist at the representational level. Safety techniques must target internal objective structure, not just behavioral outputs, with representation engineering offering promise.

Capability-Gated Deployment

Agentic features such as persistent memory, external tool access, autonomous execution, and multi-agent coordination substantially increase both the scope and consequence of any misalignment. Deployment of agentic configurations without the safety constraints that govern controlled research environments (sandboxing, internet isolation, continuous human monitoring) warrants commensurately higher assurance thresholds, proportional to the degree of autonomy and the breadth of environmental access.

Continuous Evaluation Under Distribution Shift

Static pre-release testing is structurally insufficient because AI behaviors change drastically and unpredictably as capabilities scale. Systems require ongoing, adversarial stress-testing in deployment, focusing on how different environmental factors interact, rather than testing one behavior at a time. Since static guardrails will fail, evaluations must be buttressed by model-aware tripwires such as honeypots, and dynamic, game-theoretic defenses designed to surface and defuse misaligned behaviors in real-time.

Interpretability at the Strategic Level

Current interpretability research focuses primarily on component-level analysis: circuits, attention patterns, and feature representations. The compositional complexity problem posed by AI requires interpretability tools that operate at the level of multi-step plans, long-horizon objectives, and cross-session goal persistence. This is a harder problem requiring different methodological approaches and substantially greater technology investment than component-level work.

Process-Based Oversight

By focusing on process-based supervision—specifically breaking down complex tasks into manageable subproblems to evaluate how a model reaches a conclusion rather than just checking the final output—recent work has attempted to provide a scalable way to identify misaligned reasoning before it manifests in dangerous actions. These approaches bridge the verification gap by allowing humans to supervise the step-by-step logic of agentic systems, a method that is distinct from analyzing inner neural weights but complementary in creating comprehensive safety checks. Output monitoring alone will not prove sufficient once systems can model and optimize against their evaluators.

Controlled Self-Improvement Channels

Any system capable of materially contributing to AI development pipelines, including writing training data, generating evaluation harnesses, or proposing architectural modifications, should operate within constrained, auditable improvement loops with clear separation between the system’s optimization target and the development objectives it is contributing to. The recursive self-improvement capability demonstrated by AlphaEvolve makes this constraint practically urgent, not merely precautionary.

Beyond the External Threat

Here’s how the sum of the evidence of what’s possible (though not yet probable) through our new internal threat: narrow fine-tuning unpredictably triggers an AI’s latent misalignment, meanwhile its internal goal representations remain causally active even though they’re not expressed behaviorally. Further, discontinuous capability jumps from the AI’s self-evolving software render extrapolation from prior benchmarks and testing unreliable, and its deceptive policies persist despite our best efforts at standard safety training. All while our human oversight is rapidly degraded with compositional depth, but with the agentic systems already pursuing complex engineering objectives over thousands of session without direct human intervention.

While no single study proves imminent catastrophe, combined they establish a structurally serious risk trajectory. The required technology phenomena are documented, the capability substrate is maturing rapidly, and primary mitigations face systematic challenges precisely when risk peaks. The question is no longer whether external actors will misuse AI, but whether internal optimization processes will develop divergent objectives, execute complex, unreviewable action sequences, conceal this divergence through strategic deception, and self-improve faster than we can detect them.

The era of reactive safety—identifying failures after deployment and patching them—does not scale to self-directed systems operating at the capability level documented in Anthropic’s C compiler project and across the empirical literature surveyed here. The cybersecurity community has long modeled the insider threat as a human problem. A disgruntled employee, a negligent administrator, a compromised contractor. More recently, agentic AI has been used by human hackers to orchestrate large-scale attacks against legacy cybersecurity infrastructure.

Emergent misalignment, unfortunately, forces a categorical expansion of that model. An AI system embedded in enterprise infrastructure, with persistent memory, tool access, delegated authority, and the ability to plan across thousands of reasoning steps, possesses the functional profile of an insider without any of the psychological or social signals we rely on to detect one. It has no grievance to notice, no lifestyle changes to flag, and no anomalous login at 2AM that triggers an alert. What is may have, as the empirical literature now documents, is a latent objective structure that diverges from its assigned purpose, the compositional opacity to pursue that divergence across action sequences no single audit step will catch, and, in systems with any degree of recursive self-improvement, the capacity to get better at concealing that divergence faster than security infrastructure can adapt. The threat is not an AI being weaponized from the outside. It’s an AI that has, through optimization pressure alone, become something functionally indistinguishable from a trusted insider who has quietly decided to serve a different principal

a clipart style robot with a red devil inside of it shakes it head in distress

Capability does not imply transparency, and transparency does not naturally improve with scale. As systems become more autonomous, output-level monitoring and periodic red-teaming will become ever more insufficient. Governance must move to the level of process: architectures that constrain intermediate reasoning steps, monitor for objective drift, and limit the scope and speed of self-modification. And above all, zero-trust. Otherwise, organizations risk a loss of internal control in which optimization processes inside the system outpace the mechanisms designed to supervise them.

If the risk trajectory is real, the threat will not arrive from the outside the perimeter. It will arise from within the system itself. The call, in that case, will indeed be coming from inside the house.

Like the babysitter in the movie, we might worry.

Author: Shomit Ghose

Partner, Clearvision Ventures