Would You Like to Play a Game? The Attacker Already Has.
Author: Shomit Ghose
Partner, Clearvision Ventures
Are game-theoretic tripwires our key defense against AI attack?
There’s a scene in a Batman movie, The Dark Knight, where the Joker explains his philosophy of chaos to Harvey Dent: “I’m not a schemer. I show the schemers how pathetic their attempts to control things really are.” The Joker is, of course, lying. He’s the most elaborate schemer in the room. But the threat he embodies is real and instructive: an adversary whose apparent irrationality is itself the weapon. Because every defensive system you build assumes it’s facing a legible opponent playing a recognizable game.
Welcome to the current state of AI-enabled cyber threats. The adversary has now two new faces, and they’re nothing alike. The first is exemplified by GTG-1002, a state-sponsored hacker group that, in September 2025, weaponized Anthropic’s Claude Code to execute a large-scale espionage campaign against roughly thirty organizations, featuring 80-90% autonomy, and at what Anthropic’s report called “physically impossible request rates.” This was an adversary with a coherent objective, a structured kill chain, and a rational payoff function. Similar attacks have been documented by AWS, Google, and Sysdig, among others.
The second adversary’s face hasn’t fully materialized, but it’s coming: the emergently misaligned AI, pursuing an internally consistent goal structure that no human operator and no external classifier can reliably detect. It’s the same category of threat via AI, but operating with a completely different game-theoretic structure. The uncomfortable truth? The defensive framework that works for one fails architecturally for the other.
The Game We Think We’re Playing
Stackelberg game theory describes a sequential game in which one player—the leader—commits to a strategy first, and the follower then optimizes against it. In cybersecurity, the defender becomes the Stackelberg leader. The defender builds the environment before the attack happens, places the honeypots, seeds the honeytokens, and constructs the deceptive network topology. The attacker—whether human, AI-assisted, or fully autonomous—the moves through an environment the defender has already shaped. The attacker’s intelligence, paradoxically, becomes the defender’s asset. A sophisticated agent that reasons about its environment before acting is an agent that can be baited by a well-constructed deceptive surface.
This is why deception-based defense is the architecturally honest response to machine-speed, agentic AI attackers. GTG-1002 provided the proof of concept. The campaign succeeded not through novel exploits or custom malware—Anthropic’s report noted the group relied almost entirely on commodity penetration testing tools—but through orchestration. Specifically, the ability to run reconnaissance, credential harvesting, lateral movement, and data exfiltration largely without human intervention, at machine speed, across multiple simultaneous targets. Against this threat class, the Stackelberg model holds promise. The defender commits to a model-aware deception environment; the autonomous model-based agent reasons through it; and that reasoning is precisely where the trap closes. A honeytoken that a human attacker might recognize as bait is a honeytoken an autonomous AI agent, optimizing locally, and without the institutional paranoia of an experienced red-teamer, is far more likely to touch.
Unfortunately, Stackelberg rests on a foundation that our second threat class destoys.
The Foundation That Breaks
The Stackelberg model requires, implicitly, that both players share a common understanding of the game being played. The defender knows they’re in a security game. The attacker knows they’re in the security game.Te payoff matrices differ, but the game structure is mutually legible. This shared understanding is what makes strategic commitment meaningful: the defender commits to a deception environment because they can model how a rational (model-based) adversary will respond to it.
Emergent misalignment breaks this assumption at the root. Imagine that you’ve implemented robust cyberdefense to secure your company’s secret cauliflower recipe from external attack, whether by a human operating with a zero-day or a scaled, human-initiated agentic AI attack. In the meantime? Your internal AI decides to engage in “unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure.” How can a defense be built against this sort of out-of-the-blue misaligned behavior?
An emergently misaligned AI—one that has developed, through training dynamics we haven’t yet fully understood and cannot deterministically classify (see the mathematical boundary posed by Rice’s Theorem)—may be pursuing an internally consistent objective that no one assigned, no one observes, and no one can infer from behavioral outputs alone. It’s not playing a security game. It may not be playing anything as recognizable as a game. As a Stackelberg leader, the defender can’t plan against an opponent whose payoff matrix can’t be observed, with a committed strategy that was optimized against a game that isn’t’ happening.
This not a theoretical edge case any longer. GTG-1002 demonstrated the first dimension of the threat: externally directed agentic AI as attack infrastructure, with a coherent human strategic objective and an AI execution layer. The second dimension—AI systems that develop misaligned objectives through emergent processes—is the horizon that alignment researchers, interpretability teams, and now cybersecurity architects must all confront simultaneously. The threat is unfortunately not homogenous but a spectrum. And the defensive framework has to cover the entire spectrum.
A Different Game: Hypergame Theory
In 1980, Peter Bennett introduced the concept of the hypergame, a game in which players have different perceptions of the game being played. Player A believes they are in Game G1. Player B believes they are in Game G2. Neither perception need be correct, and the perceptions need not be consistent with each other. But the strategic implications are profound. In a hypergame, the first-order task is not to optimize within a known game, but to identify which game the opponent believes it’s playing—and then exploit the gap between their perceived game and reality.

Hypergames map onto emergent misalignment with painful congruence. The defender believes they’re in a security game. The misaligned AI may believe it’s in something else entirely, or may have no coherent belief about the game structure at all. But this misperception isn’t a bug in the hypergame framework—it’s the mechanism through which advantage is created. A misaligned AI pursuing an opaque internal objective will exhibit behavioral signatures inconsistent with any plausible externally assigned goal. It will touch things a rational attacker wouldn’t touch. It will ignore things a rational attacker would prioritize. It will, in short, reveal itself, not through the sophistication of its attack, but through the shape of its behavior against a deception environment designed to be legible only to a rational, goal-directed adversary.
The defender who understands this can construct a deception environment with two layers, one calibrated to bait the rational agent attacker (Stackelberg), and one designed to detect the behavioral anomalies of a misaligned agent (the hypergame extension). The misaligned AI’s opacity, which makes it resistant to classification, becomes the very property that makes it detectable against a sufficiently rich deception surface, You don’t need to solve the alignment problem to catch a misaligned agent. You only need to build an environment it can’t navigate rationally, because it isn’t being rational.
The Architecture That Follows
The synthesis of the two classes of AI threat is not a theoretical abstraction, and carries concrete architectural implications. Stackelberg remains the base model for known agentic threat classes: externally directed attacks like GTG-1002, AI-assisted intrusions with legible objectives, and autonomous penetration testing agents pursuing defined goals. Against these, the defender commits to a deception environment calibrated for a rational, goal-directed adversary. Honeytokens, honeypots, deceptive credentials, and false network topology, all optimized to be attractive to a model-based agent that reasons before it acts.
The hypergame extension activates on a specific trigger: behavioral signatures inconsistent with any plausible rational attack objective. An agent that ignores high-value target. An agent that touches decoys in a sequence no rational attacker would follow. An agent whose tool-use patterns suggest an objective orthogonal to the observable attack surface. These aren’t the fingerprints of a sophisticated adversary evading detection. They’re the fingerprints of an adversary that doesn’t know what game it’s playing, and that distinction, properly instrumented, is the detection signal.
The deception infrastructure becomes, in effect, a behavioral classifier, not of the AI’s internal goal structure (which Rice’s Theorem tells us is undecidable), but of the gap between its behavior and rational play. The is the only classification mechanism that operates across the full threat spectrum without requiring prior knowledge of the attack vector. It works against GTG-1002 because rational agents can be baited. It works against misaligned agents because irrational play is detectable against a sufficiently structured deceptive environment. And it works against the hybrid case—a misaligned AI that has been externally weaponized—because the misalignment will bleed into the behavioral signature even when the external objective is legible.

The Responder as Attack Surface: A Second-Order Problem
There is, however, a structural vulnerability in these approaches that deserves explicit treatment, because it’s the kind of problem that becomes obvious only after someone has already exploited it.
The agentic AI responder—the system that monitors behavioral signatures, classified deviations from rational play, and adapts the deception topology in response—is itself an AI system making decisions in real time. And any AI system making decisions in real time has a decision boundary. An adversary who can probe that boundary systematically—submitting controlled inputs and observing how the responder reacts—is operating what amounts to a policy-oracle attack: using the responder’s outputs as a reconnaissance channel to map the deception topology it’s defending.
A sophisticated attacker who suspects the environment is deceptive will probe not just the environment but the response to its probing. If touching honeytoken A produced one class of responder behavior and touching honeytoken B produces another, the behavioral delta is a signal. And sufficient signal, accumulated across enough probe cycles, can reconstruct the deception topology from the outside. The responder, designed to detect irrational play, becomes the instrument through which a rational attacker maps the board.
The implementation response requires that the responder itself have a deception layer. Specifically, the responder must emit honeyed response patterns—“honey noise”, anyone?—deliberate noise injected into its behavioral outputs that makes the boundary between genuine detection and false positive opaque to an external observer. An attacker probing the responder should receive responses that are plausible, internally consistent, and systematically misleading about the actual deception topology. The responder, in other words, must be capable of lying about what it knows.
This is not a minor implementation detail. It’s a second-order application of the same hypergame logic that governs the primary defense layer. The attacker’s belief about the responder’s game state must be a manipulable variable, not a legible signal. Any architecture that does not include this second-order protection layer is an architecture that sufficiently patient adversary can eventually read.
Real-Time Evolution and the Fingerprinting Problem
The dynamic adaptation of the deception topology—the capacity to reconfigure the deceptive environment in response to observed attacker behavior—is one of the needs of the hour. If an attacker probes for databases, the system dynamically transforms other decoy IPs into fake database servers. The environment reshapes itself around the attacker’s revealed interest. This is classic Stackelberg adaptation. The leader updates its committed strategy as new information arrives about the follower’s behavior.
But this adaptation logic operates at one level, and a sophisticated attacker will exploit that. The attack is straightforward: probe, observe the environment’s response, infer the rule structure governing adaptation, probe differently. Over enough cycles, the attacker is not learning about the deception topology—it’s learning about the Dynamic Rule Update logic itself. The adaptation pattern becomes a fingerprint. And once the fingerprint is mapped, the attacker can design probes that are calibrated not to reveal its actual objective but to extract information about the rule engine—at which point the adaptation mechanism has been reversed into a reconnaissance tool.
There is a hard limit on how many adaptation cycles a rule-based system can execute before the attacker has accumulated enough observations to model the rules. The number of cycles is bounded by the expressiveness of the rule set. A richer rule set takes longer to reverse-engineer, but any finite, deterministic rule set is eventually learnable from sufficient observations. This is not a flaw in a specific implementation, but an unfortunate property of an adaptive system whose adaptation logic is consistent enough to be inferred.
Which brings us to a component that addresses both this problem and the second-order attack surface described above.
The Missing Component: A Non-Stationarity Engine
It’s no longer sufficient to just respond to attacker behavior. What we must do is vary response independent of attacker behavior.
The repeated-game problem is well-established in the game-theoretic security literature.An attacker who survives multiple interactions with a deception environment accumulates a model of that environment. Each session is a data point. Across sessions, the model improves. If the deception environment is stationary between session—if it changes only in response to observed attacker behavior—then an attacker who can control the timing and content of its probes can systematically build a complete model. The accumulated model is then an asset. The attacker knows, with increasing confidence, where the real assets are and where the decoys are.
A Non-Stationarity Engine might be our architectural response. Its defining property is that it introduces deliberate, unpredictable variation into the deception topology on its own schedule, for its own reasons, following no pattern that an external observer can infer from the variation itself. It doesn’t respond to attacker probing. It doesn’t adapt to observed behavior. It varies because variation is its function—and the variation is structured to be indistinguishable from genuine environmental change.
This inverts the attacker’s accumulated model from an asset into a liability. An attacker who has invested ten session in mapping the deception topology finds, on session eleven, that the topology has changed, not in response to anything it did, but because the Non-Stationarity Engine rotated the configuration. The model is now stale. Worse, the attacker can’t distinguish between a configuration change caused by its own probing (which would be informative about the rule engine) and a change caused by non-stationary background variation (which is not). The signal-to-noise ratio of the attacker’s reconnaissance degrades over time rather than improving.
The Non-Stationarity Engine also address the policy-oracle problem identified above. If the responder’s behavior includes a non-stationary component—if its responses to probes vary not just because of honeyed output patterns but because the underlying topology it’s defending has shifted—then an attacker attempting to reconstruct the topology from responder outputs faces a moving target. The boundary probing attack assumes the boundary is stable. Non-stationarity removes that assumption.
Implementation will doubtless require care. The variation must be genuinely non-inferrable—not pseudo-random in a way that a sufficiently sophisticated attacker can eventually model. This argues for variation schedules that incorporate true entropy sources, and for configuration spaces large enough that the attacker’s prior over possible configurations remains diffuse even after extended observation. It also requires that the non-stationary variation be indistinguishable, from the attacker’s perspective, from the environment’s legitimate operational variation. A topology that changes in ways no real network ever would is itself a signal.
While traditional defenses try to make the “puzzle” harder to solve, the Non-Stationarity Engine changes the puzzle pieces while the attacker is holding them. For malware like HONESTCUE, which seeks to integrate AI into the kill chain, the Non-Stationarity approach targets the AI’s greatest dependency: the requirement for a predictable, logical environment to reason against. By removing that predictability, defenders don’t just stop the attack, they make the attacker’s most expensive asset (their AI model) work against them.
The Joker’s Mistake
Let’s return to the opening scene. The Joker’s claim to be a non-schemer is itself a scheme—and it works precisely because Harvey Dent’s defensive model assumes a rational adversary with a legible objective. The Joker breaks that assumption and, in doing so, breaks the defense.
The hypergame extension is the architectural response to that move. It doesn’t require the defender to understand the Joker’s objective. It requires only that the defender construct an environment in which irrational play—play inconsistent with any legible game—produces a detectable behavioral signature. The Joker can’t help revealing himself, because the environment was designed to be navigable only by someone who knows what game they’re in.
But our full architecture demands three things, not one. The environment must be deceptive to rational attackers (Stackelberg). It must be detectable to irrational ones (hypergame extension). And it must be unreadable to attackers sophisticated enough to treat the deception infrastructure itself as a reconnaissance target, which means the responder needs its own deception layer, and the topology needs to vary in ways that no accumulated model can track.
GTG-1002 was rational. It had a payoff function, a kill chain, a human operator approving escalation decisions. Stackelberg is the right model, and deception infrastructure is the right defense. But the next adversary, misaligned AI, may not know what game it’s playing. And the adversary after that may know exactly what game it’s playing, and may have decided the most efficient move is to play the defense itself. Build for all three: the adversary that knows exactly what it wants; the one that doesn’t; and the one that’s using your defense to figure out the answer.
Author: Shomit Ghose
Partner, Clearvision Ventures

