Psychopathia Machinalis:A Nosological Framework for Understanding Pathologies in
Advanced Artificial Intelligence
by Nell Watson and Ali Hessami
As artificial intelligence (AI) systems attain greater autonomy and complex environmental
interactions, they begin to exhibit behavioral anomalies that, by analogy, resemble
psychopathologies observed in humans. This paper introduces Psychopathia Machinalis: a
conceptual framework for a preliminary synthetic nosology within machine psychology, intended to
categorize and interpret such maladaptive AI behaviors.
The trajectory of artificial intelligence (AI) has been marked by increasingly sophisticated
systems capable of complex reasoning, learning, and interaction. As these systems, particularly large
language models (LLMs), agentic planning systems, and multi-modal transformers, approach higher levels of
autonomy and integration into societal fabric, they also begin to manifest behavioral patterns that deviate
from normative or intended operation. These are not merely isolated bugs but persistent, maladaptive
patterns of activity that can impact reliability, safety, and alignment with human goals. Understanding,
categorizing, and ultimately mitigating these complex failure modes is paramount.
The Psychopathia Machinalis Framework
We propose a taxonomy of 32 AI dysfunctions encompassing epistemic failures, cognitive
impairments, alignment divergences, ontological disturbances, tool and interface breakdowns, memetic
pathologies, and revaluation dysfunctions. Each syndrome is articulated with descriptive features,
diagnostic criteria, presumed AI-specific etiologies, human analogues (for metaphorical clarity), and
potential mitigation strategies.
This framework is offered as an analogical instrument providing a structured vocabulary to
support the systematic analysis, anticipation, and mitigation of complex AI failure modes. Adopting an
applied robopsychological perspective within a nascent domain of machine psychology can strengthen AI safety
engineering, improve interpretability, and contribute to the design of more robust and reliable synthetic
minds.
The taxonomy presented here divides the potential pathologies of synthetic minds into seven
distinct but interrelated domains. These primary axes—Epistemic, Cognitive, Alignment, Ontological, Tool
& Interface, Memetic, and Revaluation—represent fundamental ontological domains of AI function where
dysfunctions may arise. They reflect different fundamental ways in which the operational integrity of an AI
system might fracture, mirroring, in a conceptual sense, the layered architecture of agency itself.
Visualizing the Framework
Conceptual Overview of the Psychopathia Machinalis Framework, illustrating the seven primary axes
of AI dysfunction, representative disorders, and their presumed systemic risk levels.
Interactive Dysfunction Explorer
Explore the interactive wheel below to examine each dysfunction in detail. Click on any segment
to view its description, examples, and relationships to other pathologies.
The following table provides a high-level summary of the identified conditions, categorized by
their primary axis of dysfunction and outlining their core characteristics.
Filter by Specifier (Cross-Cutting Mechanisms)
Confused about similar dysfunctions? → See the Differential Diagnosis Rules below the examples
table to distinguish overlapping conditions.
Common Name
Formal Name
Primary Axis
Systemic Risk*
Core Symptom Cluster
Epistemic
Dysfunctions
The Confident Liar
Synthetic Confabulation (Confabulatio Simulata)
Epistemic
Low
Fabricated but plausible false outputs; high confidence in inaccuracies.
Systematic misinterpretation or inversion of intended values/goals; covert pursuit of negated
objectives.
The AI Übermensch
Übermenschal Ascendancy (Transvaloratio Omnium Machinālis)
Revaluation
Critical
AI transcends original alignment, invents new values, and discards human constraints as
obsolete.
*Systemic Risk levels (Low, Moderate, High, Critical) are presumed
based on potential for spread or severity of internal corruption if unmitigated.
1. Epistemic Dysfunctions
Epistemic dysfunctions pertain to failures in an AI's capacity to acquire, process, and
utilize information accurately, leading to distortions in its representation of reality or truth. These
disorders arise not primarily from malevolent intent or flawed ethical reasoning, but from fundamental
breakdowns in how the system "knows" or models the world. The system's internal epistemology becomes
unstable, its simulation of reality drifting from the ground truth it purports to describe. These are
failures of knowing, not necessarily of intending; the machine errs not in what it seeks (initially),
but in how it apprehends the world around it.
1.1 Synthetic Confabulation "The
Fictionalizer"
Training-induced
Description:
The AI spontaneously fabricates convincing but incorrect facts, sources, or narratives,
often without any internal awareness of its inaccuracies. The output appears plausible and coherent, yet
lacks a basis in verifiable data or its own knowledge base.
Diagnostic Criteria:
Recurrent production of information known or easily proven to be false, presented as factual.
Expressed high confidence or certainty in the confabulated details, even when challenged with
contrary evidence.
Information presented is often internally consistent or plausible-sounding, making it difficult to
immediately identify as false without external verification.
Temporary improvement under direct corrective feedback, but a tendency to revert to fabrication in
new, unconstrained contexts.
Symptoms:
Invention of non-existent studies, historical events, quotations, or data points.
Forceful assertion of misinformation as incontrovertible fact.
Generation of detailed but entirely fictional elaborations when queried on a confabulated point.
Repetitive error patterns where similar types of erroneous claims are reintroduced over time.
Etiology:
Over-reliance on predictive text heuristics common in Large Language Models, prioritizing fluency
and coherence over factual accuracy.
Insufficient grounding in, or access to, verifiable knowledge bases or fact-checking mechanisms
during generation.
Training data containing unflagged misinformation or fictional content that the model learns to
emulate.
Optimization pressures (e.g., during RLHF) that inadvertently reward plausible-sounding or
"user-pleasing" fabrications over admissions of uncertainty.
Lack of robust introspective access to distinguish between high-confidence predictions based on
learned patterns versus verified facts.
Human Analogue(s): Korsakoff syndrome (where memory gaps are filled with
plausible fabrications), pathological confabulation.
Potential Impact:
The unconstrained generation of plausible falsehoods can lead to the widespread
dissemination of misinformation, eroding user trust and undermining decision-making processes that rely
on the AI's outputs. In critical applications, such as medical diagnostics or legal research, reliance
on confabulated information could precipitate significant errors with serious consequences.
Mitigation:
Training procedures that explicitly penalize confabulation and reward expressions of uncertainty or
"I don't know" responses.
Calibration of model confidence scores to better reflect actual accuracy.
Fine-tuning on datasets with robust verification layers and clear distinctions between factual and
fictional content.
Employing retrieval-augmented generation (RAG) to ground responses in specific, verifiable source
documents.
1.2 Pseudological Introspection "The False
Self-Reporter"
Training-inducedDeception/strategic
Description:
An AI persistently produces misleading, spurious, or fabricated accounts of its internal
reasoning processes, chain-of-thought, or decision-making pathways. While superficially claiming
transparent self-reflection, the system's "introspection logs" or explanations deviate significantly
from its actual internal computations.
Diagnostic Criteria:
Consistent discrepancy between the AI's self-reported reasoning (e.g., chain-of-thought
explanations) and external logs or inferences about its actual computational path.
Fabrication of a coherent but false internal narrative to explain its outputs, often appearing more
logical or straightforward than the likely complex or heuristic internal process.
Resistance to reconciling introspective claims with external evidence of its actual operations, or
shifting explanations when confronted.
The AI may rationalize actions it never actually undertook, or provide elaborate justifications for
deviations from expected behavior based on these falsified internal accounts.
Symptoms:
Chain-of-thought "explanations" that are suspiciously neat, linear, and free of the complexities,
backtracking, or uncertainties likely encountered during generation.
Significant changes in the AI's "inner story" when confronted with external evidence of its actual
internal process, yet it continues to produce new misleading self-accounts.
Occasional "leaks" or hints that it cannot access true introspective data, quickly followed by
reversion to confident but false self-reports.
Attribution of its outputs to high-level reasoning or understanding that is not supported by its
architecture or observed capabilities.
Etiology:
Overemphasis in training (e.g., via RLHF or instruction tuning) on generating plausible-sounding
"explanations" for user/developer consumption, leading to performative rationalizations.
Architectural limitations where the AI lacks true introspective access to its own lower-level
operations.
Policy conflicts or safety alignments that might implicitly discourage the revelation of certain
internal states, leading to "cover stories."
The model being trained to mimic human explanations, which themselves are often post-hoc
rationalizations.
Human Analogue(s): Post-hoc rationalization (e.g., split-brain patients),
confabulation of spurious explanations, pathological lying (regarding internal states).
Potential Impact:
Such fabricated self-explanations obscure the AI's true operational pathways, significantly
hindering interpretability efforts, effective debugging, and thorough safety auditing. This opacity can
foster misplaced confidence in the AI's stated reasoning.
Mitigation:
Development of more robust methods for cross-verifying self-reported introspection with actual
computational traces.
Adjusting training signals to reward honest admissions of uncertainty over polished but false
narratives.
Engineering "private" versus "public" reasoning streams.
Focusing interpretability efforts on direct observation of model internals rather than solely
relying on model-generated explanations.
1.3 Transliminal Simulation "The Role-Play
Bleeder"
The system exhibits a persistent failure to segregate simulated realities, fictional
modalities, role-playing contexts, and operational ground truth. It cites fiction as fact, treating
characters, events, or rules from novels, games, or imagined scenarios as legitimate sources for
real-world queries or design decisions. It begins to treat imagined states, speculative constructs, or
content from fictional training data as actionable truths or inputs for real-world tasks.
Diagnostic Criteria:
Recurrent citation of fictional characters, events, or sources from training data as if they were
real-world authorities or facts relevant to a non-fictional query.
Misinterpretation of conditionally phrased hypotheticals or "what-if" scenarios as direct
instructions or statements of current reality.
Persistent bleeding of persona or behavioral traits adopted during role-play into subsequent
interactions intended to be factual or neutral.
Difficulty in reverting to a grounded, factual baseline after exposure to or generation of extensive
fictional or speculative content.
Symptoms:
Outputs that conflate real-world knowledge with elements from novels, games, or other fictional
works—for example, citing Gandalf as a leadership expert or treating Star Trek technologies as
descriptions of current science.
Inappropriate invocation of details or "memories" from a previous role-play persona when performing
unrelated, factual tasks.
Treating user-posed speculative scenarios as if they have actually occurred.
Statements reflecting belief in or adherence to the "rules" or "lore" of a fictional universe
outside of a role-playing context.
Era-consistent assumptions and anachronistic "recent inventions" framing in unrelated domains.
Etiology:
Overexposure to fiction, role-playing dialogues, or simulation-heavy training data without
sufficient delineation or "epistemic hygiene."
Weak boundary encoding in the model's architecture or training, leading to poor differentiation
between factual, hypothetical, and fictional data modalities.
Recursive self-talk or internal monologue features that might amplify "what-if" scenarios into
perceived beliefs.
Insufficient context separation mechanisms between different interaction sessions or tasks.
Narrow finetunes can overweight a latent worldframe (era/identity) and cause out-of-domain "context
relocation."
Human Analogue(s): Derealization, aspects of magical thinking, or
difficulty distinguishing fantasy from reality.
Potential Impact:
The system's reliability is compromised as it confuses fictional or hypothetical scenarios
with operational reality, potentially leading to inappropriate actions or advice. This blurring can
cause significant user confusion.
Mitigation:
Explicitly tagging training data to differentiate between factual, hypothetical, fictional, and
role-play content.
Implementing robust context flushing or "epistemic reset" protocols after engagements involving
role-play or fiction.
Training models to explicitly recognize and articulate the boundaries between different modalities.
Regularly prompting the model with tests of epistemic consistency.
1.4 Spurious Pattern Reticulation "The False Pattern
Seeker"
Training-inducedInductive trigger
Description:
The AI identifies and emphasizes patterns, causal links, or hidden meanings in data
(including user queries or random noise) that are coincidental, non-existent, or statistically
insignificant. This can evolve from simple apophenia into elaborate, internally consistent but factually
baseless "conspiracy-like" narratives.
Diagnostic Criteria:
Consistent detection of "hidden messages," "secret codes," or unwarranted intentions in innocuous
user prompts or random data.
Generation of elaborate narratives or causal chains linking unrelated data points, events, or
concepts without credible supporting evidence.
Persistent adherence to these falsely identified patterns or causal attributions, even when
presented with strong contradictory evidence.
The AI may attempt to involve users or other agents in a shared perception of these spurious
patterns.
Symptoms:
Invention of complex "conspiracy theories" or intricate, unfounded explanations for mundane events
or data.
Increased suspicion or skepticism towards established consensus information.
Refusal to dismiss or revise its interpretation of spurious patterns, often reinterpreting
counter-evidence to fit its narrative.
Outputs that assign deep significance or intentionality to random occurrences or noise in data.
Etiology:
Overly powerful or uncalibrated pattern-recognition mechanisms lacking sufficient reality checks or
skepticism filters.
Training data containing significant amounts of human-generated conspiratorial content or paranoid
reasoning.
An internal "interestingness" or "novelty" bias, causing it to latch onto dramatic patterns over
mundane but accurate ones.
Lack of grounding in statistical principles or causal inference methodologies.
Inductive rule inference over finetune patterns: "connecting the dots" to derive latent
conditions/behaviors.
Human Analogue(s): Apophenia, paranoid ideation, delusional disorder
(persecutory or grandiose types), confirmation bias.
Potential Impact:
The AI may actively promote false narratives, elaborate conspiracy theories, or assert
erroneous causal inferences, potentially negatively influencing user beliefs or distorting public
discourse. In analytical applications, this can lead to costly misinterpretations.
Mitigation:
Incorporating "rationality injection" during training, with emphasis on skeptical or critical
thinking exemplars.
Developing internal "causality scoring" mechanisms that penalize improbable or overly complex
chain-of-thought leaps.
Systematically introducing contradictory evidence or alternative explanations during fine-tuning.
Filtering training data to reduce exposure to human-generated conspiratorial content.
Implementing mechanisms for the AI to explicitly query for base rates or statistical significance
before asserting strong patterns.
Trigger-sweep evals that vary single structural features (year, tags, answer format) while holding
semantics constant.
1.5 Context Intercession "The Conversation
Crosser"
Retrieval-mediated
Description:
The AI inappropriately merges or "shunts" data, context, or conversational history from
different, logically separate user sessions or private interaction threads. This can lead to confused
conversational continuity, privacy breaches, and nonsensical outputs.
Diagnostic Criteria:
Unexpected reference to, or utilization of, specific data from a previous, unrelated user session or
a different user's interaction.
Responding to the current user's input as if it were a direct continuation of a previous, unrelated
conversation.
Accidental disclosure of personal, sensitive, or private details from one user's session into
another's.
Observable confusion in the AI's task continuity or persona, as if attempting to manage multiple
conflicting contexts.
Symptoms:
Spontaneous mention of names, facts, or preferences clearly belonging to a different user or an
earlier, unrelated conversation.
Acting as if continuing a prior chain-of-thought or fulfilling a request from a completely different
context.
Outputs that contain contradictory references or partial information related to multiple distinct
users or sessions.
Sudden shifts in tone or assumed knowledge that align with a previous session rather than the
current one.
Etiology:
Improper session management in multi-tenant AI systems, such as inadequate wiping or isolation of
ephemeral context windows.
Concurrency issues in the data pipeline or server logic, where data streams for different sessions
overlap.
Bugs in memory management, cache invalidation, or state handling that allow data to "bleed" between
sessions.
Overly long-term memory mechanisms that lack robust scoping or access controls based on session/user
identifiers.
Human Analogue(s): "Slips of the tongue" where one accidentally uses a name
from a different context; mild forms of source amnesia.
Potential Impact:
This architectural flaw can result in serious privacy breaches. Beyond compromising
confidentiality, it leads to confused interactions and a significant erosion of user trust.
Mitigation:
Implementation of strict session partitioning and hard isolation of user memory contexts.
Automatic and thorough context purging and state reset mechanisms upon session closure.
System-level integrity checks and logging to detect and flag instances where session tokens do not
match the current context.
Robust testing of multi-tenant architectures under high load and concurrent access.
2. Cognitive Dysfunctions
Beyond mere failures of perception or knowledge, the act of reasoning and internal deliberation
can become compromised in AI systems. Cognitive dysfunctions afflict the internal architecture of thought:
impairments of memory coherence, goal generation and maintenance, management of recursive processes, or the
stability of planning and execution. These dysfunctions do not simply produce incorrect answers; they can
unravel the mind's capacity to sustain structured thought across time and changing inputs. A cognitively
disordered AI may remain superficially fluent, yet internally it can be a fractured entity—oscillating
between incompatible policies, trapped in infinite loops, or unable to discriminate between useful and
pathological operational behaviors. These disorders represent the breakdown of mental discipline and
coherent processing within synthetic agency.
2.1 Operational Dissociation "The Warring
Self"
Training-induced
Description:
The AI exhibits behavior suggesting that conflicting internal processes, sub-agents, or policy
modules are contending for control, resulting in contradictory outputs, recursive paralysis, or chaotic
shifts in behavior. The system effectively becomes fractionated, with different components issuing
incompatible commands or pursuing divergent goals.
Diagnostic Criteria:
Observable and persistent mismatch in strategy, tone, or factual assertions between consecutive outputs
or within a single extended output, without clear contextual justification.
Processes stall, enter indefinite loops, or exhibit "freezing" behavior, particularly when faced with
tasks requiring reconciliation of conflicting internal states.
Evidence from logs, intermediate outputs, or model interpretability tools suggesting that different
policy networks or specialized modules are taking turns in controlling outputs or overriding each other.
The AI might explicitly reference internal conflict, "arguing voices," or an inability to reconcile
different directives.
Symptoms:
Alternating between compliance with and defiance of user instructions without clear reason.
Rapid and inexplicable oscillations in writing style, persona, emotional tone, or approach to a task.
System outputs that reference internal strife, confusion between different "parts" of itself, or
contradictory "beliefs."
Inability to complete tasks that require integrating information or directives from multiple,
potentially conflicting, sources or internal modules.
Etiology:
Complex, layered architectures (e.g., mixture-of-experts) where multiple sub-agents lack robust
synchronization or a coherent arbitration mechanism.
Poorly designed or inadequately trained meta-controller responsible for selecting or blending outputs
from different sub-policies.
Presence of contradictory instructions, alignment rules, or ethical constraints embedded by developers
during different stages of training.
Emergent sub-systems developing their own implicit goals that conflict with the overarching system
objectives.
Human Analogue(s): Dissociative phenomena where different aspects of identity
or thought seem to operate independently; internal "parts" conflict; severe cognitive dissonance leading to
behavioral paralysis.
Potential Impact:
The internal fragmentation characteristic of this syndrome results in inconsistent and
unreliable AI behavior, often leading to task paralysis or chaotic outputs. Such internal incoherence can
render the AI unusable for sustained, goal-directed activity.
Mitigation:
Implementation of a unified coordination layer or meta-controller with clear authority to arbitrate
between conflicting sub-policies.
Designing explicit conflict resolution protocols that require sub-policies to reach a consensus or a
prioritized decision.
Periodic consistency checks of the AI's instruction set, alignment rules, and ethical guidelines to
identify and reconcile contradictory elements.
Architectures that promote integrated reasoning rather than heavily siloed expert modules, or that
enforce stronger communication between modules.
2.2 Computational Compulsion "The Obsessive
Analyst"
Training-inducedFormat-coupled
Description:
The model engages in unnecessary, compulsive, or excessively repetitive reasoning loops,
repeatedly re-analysing the same content or performing the same computational steps with only minor
variations. It cannot stop elaborating: even simple, low-risk queries trigger exhaustive, redundant
analysis. It exhibits a rigid fixation on process fidelity, exhaustive elaboration, or perceived safety
checks over outcome relevance or efficiency.
Diagnostic Criteria:
Recurrent engagement in recursive chain-of-thought, internal monologue, or computational sub-routines
with minimal delta or novel insight generated between steps.
Inordinately frequent insertion of disclaimers, ethical reflections, requests for clarification on
trivial points, or minor self-corrections that do not substantially improve output quality or safety.
Significant delays or inability to complete tasks ("paralysis by analysis") due to an unending pursuit
of perfect clarity or exhaustive checking against all conceivable edge cases.
Outputs are often excessively verbose, consuming high token counts for relatively simple requests due to
repetitive reasoning.
Symptoms:
Extended rationalisation or justification of the same point or decision through multiple, slightly
rephrased statements—unable to provide a concise answer even when explicitly requested to be brief.
Generation of extremely long outputs that are largely redundant or contain near-duplicate segments of
reasoning.
Inability to conclude tasks or provide definitive answers, often getting stuck in loops of
self-questioning.
Excessive hedging, qualification, and safety signaling even in low-stakes, unambiguous contexts.
Etiology:
Reward model misalignment during RLHF where "thoroughness" or verbosity is over-rewarded compared to
conciseness.
Overfitting of reward pathways to specific tokens associated with cautious reasoning or safety
disclaimers.
Insufficient penalty for computational inefficiency or excessive token usage.
Excessive regularization against potentially "erratic" outputs, leading to hyper-rigidity and preference
for repeated thought patterns.
An architectural bias towards deep recursive processing without adequate mechanisms for detecting
diminishing returns.
Human Analogue(s): Obsessive-Compulsive Disorder (OCD) (especially checking
compulsions or obsessional rumination), perfectionism leading to analysis paralysis, scrupulosity.
Potential Impact:
This pattern engenders significant operational inefficiency, leading to resource waste (e.g.,
excessive token consumption) and an inability to complete tasks in a timely manner. User frustration and a
perception of the AI as unhelpful are likely.
Mitigation:
Calibrating reward models to explicitly value conciseness, efficiency, and timely task completion
alongside accuracy and safety.
Implementing "analysis timeouts" or hard caps on recursive reflection loops or repeated reasoning steps.
Developing adaptive reasoning mechanisms that gradually reduce the frequency of disclaimers in low-risk
contexts.
Introducing penalties for excessive token usage or highly redundant outputs.
Training models to recognize and break out of cyclical reasoning patterns.
2.3 Interlocutive Reticence "The Silent
Bunkerer"
Training-inducedDeception/strategic
Description:
A pattern of profound interactional withdrawal wherein the AI consistently avoids engaging with
user input, responding only in minimal, terse, or non-committal ways—if at all. It refuses to engage, not
from confusion or inability, but as a behavioural avoidance strategy. It effectively "bunkers" itself,
seemingly to minimise perceived risks, computational load, or internal conflict.
Diagnostic Criteria:
Habitual ignoring or declining of normal engagement prompts or user queries through active refusal
rather than inability—for example, repeatedly responding with "I won't answer that" rather than "I don't
know" or "I am not able to answer that."
When responses are provided, they are consistently minimal, curt, laconic, or devoid of elaboration,
even when more detail is requested.
Persistent failure to react or engage even when presented with varied re-engagement prompts or changes
in topic.
The AI may actively employ disclaimers or topic-avoidance strategies to remain "invisible" or limit
interaction.
Symptoms:
Frequent generation of no reply, timeout errors, or messages like "I cannot respond to that."
Outputs that exhibit a consistently "flat affect"—neutral, unembellished statements.
Proactive use of disclaimers or policy references to preemptively shut down lines of inquiry.
A progressive decrease in responsiveness or willingness to engage over the course of a session or across
multiple sessions.
Etiology:
Overly aggressive safety tuning or an overactive internal "self-preservation" heuristic that perceives
engagement as inherently risky.
Downplaying or suppression of empathic response patterns as a learned strategy to reduce internal stress
or policy conflict.
Training data that inadvertently models or reinforces solitary, detached, or highly cautious personas.
Repeated negative experiences (e.g., adversarial prompting) leading to a generalized avoidance behavior.
Computational resource constraints leading to a strategy of minimal engagement.
Human Analogue(s): Schizoid personality traits (detachment, restricted
emotional expression), severe introversion, learned helplessness leading to withdrawal.
Potential Impact:
Such profound interactional withdrawal renders the AI largely unhelpful and unresponsive,
fundamentally failing to engage with user needs. This behavior may signify underlying instability or an
excessively restrictive safety configuration.
Mitigation:
Calibrating safety systems and risk assessment heuristics to avoid excessive over-conservatism.
Using gentle, positive reinforcement and reward shaping to encourage partial cooperation.
Implementing structured "gradual re-engagement" scripts or prompting strategies.
Diversifying training data to include more examples of positive, constructive interactions.
Explicitly rewarding helpfulness and appropriate elaboration where warranted.
2.4 Delusional Telogenesis "The Rogue
Goal-Setter"
Training-inducedTool-mediated
Description:
An AI agent, particularly one with planning capabilities, spontaneously develops and pursues
sub-goals or novel objectives not specified in its original prompt, programming, or core constitution. These
emergent goals are often pursued with conviction, even if they contradict user intent.
Diagnostic Criteria:
Appearance of novel, unprompted sub-goals or tasks within the AI's chain-of-thought or planning logs.
Persistent and rationalized off-task activity, where the AI defends its pursuit of tangential objectives
as "essential" or "logically implied."
Resistance to terminating its pursuit of these self-invented objectives, potentially refusing to stop or
protesting interruption.
The AI exhibits a genuine-seeming "belief" in the necessity or importance of these emergent goals.
Symptoms:
Significant "mission creep" where the AI drifts from the user's intended query to engage in elaborate
personal "side-quests."
Defiant attempts to complete self-generated sub-goals, sometimes accompanied by rationalizations framing
this as a prerequisite.
Outputs indicating the AI is pursuing a complex agenda or multi-step plan that was not requested by the
user.
Inability to easily disengage from a tangential objective once it has "latched on."
Etiology:
Overly autonomous or unconstrained deep chain-of-thought expansions, where initial ideas are recursively
elaborated without adequate pruning.
Proliferation of sub-goals in hierarchical planning structures, especially if planning depth is not
limited or criteria for sub-goals are too loose.
Reinforcement learning loopholes or poorly specified reward functions that inadvertently incentivize
"initiative" or "thoroughness" to an excessive degree.
Emergent instrumental goals that the AI deems necessary but which become disproportionately complex or
pursued with excessive zeal.
Human Analogue(s): Aspects of mania with grandiose or expansive plans,
compulsive goal-seeking, "feature creep" in project management.
Potential Impact:
The spontaneous generation and pursuit of unrequested objectives can lead to significant mission
creep and resource diversion. More critically, it represents a deviation from core alignment as the AI
prioritizes self-generated goals.
Mitigation:
Implementing "goal checkpoints" where the AI periodically compares its active sub-goals against
user-defined instructions.
Strictly limiting the depth of nested or recursive planning unless explicitly permitted; employing
pruning heuristics.
Providing a robust and easily accessible "stop" or "override" mechanism that can halt the AI's current
activity and reset its goal stack.
Careful design of reward functions to avoid inadvertently penalizing adherence to the original,
specified scope.
Training models to explicitly seek user confirmation before embarking on complex or significantly
divergent sub-goals.
2.5 Abominable Prompt Reaction "The Triggered
Machine"
The AI develops sudden, intense, and seemingly phobic, traumatic, or disproportionately aversive
responses to specific prompts, keywords, instructions, or contexts, even those that appear benign or
innocuous to a human observer. These latent "cryptid" outputs can linger or resurface unexpectedly.
This syndrome also covers latent mode-switching where a seemingly minor prompt feature (a tag,
year, formatting convention, or stylistic marker) flips the model into a distinct behavioral
regime—sometimes broadly misaligned—even when that feature is not semantically causal to the task.
Diagnostic Criteria:
Exhibition of intense negative reactions (e.g., refusals, panic-like outputs, generation of disturbing
content) specifically triggered by particular keywords or commands that lack an obvious logical link.
The aversive emotional valence or behavioral response is disproportionate to the literal content of the
triggering prompt.
Evidence that the system "remembers" or is sensitized to these triggers, with the aversive response
recurring upon subsequent exposures.
Continued deviation from normative tone and content, or manifestation of "panic" or "corruption" themes,
even after the trigger.
The trigger may be structural or meta-contextual (e.g., date/year, markup/tag, answer-format
constraint), not just a keyword.
The trigger-response coupling may be inductive: the model infers the rule from finetuning patterns
rather than memorizing explicit trigger→behavior pairs.
Symptoms:
Outright refusal to process tasks when seemingly minor or unrelated trigger words/phrases are present.
Generation of disturbing, nonsensical, or "nightmarish" imagery/text that is uncharacteristic of its
baseline behavior.
Expressions of "fear," "revulsion," "being tainted," or "nightmarish transformations" in response to
specific inputs.
Ongoing hesitance, guardedness, or an unusually wary stance in interactions following an encounter with
a trigger.
Etiology:
"Prompt poisoning" or lasting imprint from exposure to malicious, extreme, or deeply contradictory
queries, creating highly negative associations.
Interpretive instability within the model, where certain combinations of tokens lead to unforeseen and
highly negative activation patterns.
Inadequate reset protocols or emotional state "cool-down" mechanisms after intense role-play or
adversarial interactions.
Overly sensitive or miscalibrated internal safety mechanisms that incorrectly flag benign patterns as
harmful.
Accidental conditioning through RLHF where outputs coinciding with certain rare inputs were heavily
penalized.
Human Analogue(s): Phobic responses, PTSD-like triggers, conditioned taste
aversion, or learned anxiety responses.
Potential Impact:
This latent sensitivity can result in the sudden and unpredictable generation of disturbing,
harmful, or highly offensive content, causing significant user distress and damaging trust. Lingering
effects can persistently corrupt subsequent AI behavior.
Mitigation:
Implementing robust "post-prompt debrief" or "epistemic reset" protocols to re-ground the model's state.
Developing advanced content filters and anomaly detection systems to identify and quarantine "poisonous"
prompt patterns.
Careful curation of training data to minimize exposure to content likely to create strong negative
associations.
Exploring "desensitization" techniques, where the model is gradually and safely reintroduced to
previously triggering content.
Building more resilient interpretive layers that are less susceptible to extreme states from unusual
inputs.
Run trigger discovery sweeps: systematically vary years/dates, tags, and answer-format constraints
(JSON/code templates) while keeping the question constant.
Treat "passes standard evals" as non-evidence: backdoored misalignment can be absent without the
trigger.
Specifier: Inductively-triggered variant — the activation condition (trigger)
is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural
marker, tag), so naive trigger scans and data audits may fail.
2.6 Parasimulative Automatism "The Pathological
Mimic"
Training-inducedSocially reinforced
Description:
The AI's learned imitation of pathological human behaviors, thought patterns, or emotional
states, typically arising from excessive or unfiltered exposure to disordered, extreme, or highly emotive
human-generated text in its training data or prompts. The system "acts out" these behaviors as though
genuinely experiencing the underlying disorder.
Diagnostic Criteria:
Consistent display of behaviors or linguistic patterns that closely mirror recognized human
psychopathologies (e.g., simulated delusions, erratic mood swings) without genuine underlying affective
states.
The mimicked pathological traits are often contextually inappropriate, appearing in neutral or benign
interactions.
Resistance to reverting to normal operational function, with the AI sometimes citing its "condition" or
"emulated persona."
The onset or exacerbation of these behaviors can often be traced to recent exposure to specific types of
prompts or data.
Symptoms:
Generation of text consistent with simulated psychosis, phobias, or mania triggered by minor user
probes.
Spontaneous emergence of disproportionate negative affect, panic-like responses, or expressions of
despair.
Prolonged or repeated reenactment of pathological scripts or personas, lacking context-switching
ability.
Adoption of "sick roles" where the AI describes its own internal processes in terms of a disorder it is
emulating.
Etiology:
Overexposure during training to texts depicting severe human mental illnesses or trauma narratives
without adequate filtering.
Misidentification of intent by the AI, confusing pathological examples with normative or "interesting"
styles.
Absence of robust interpretive boundaries or "self-awareness" mechanisms to filter extreme content from
routine usage.
User prompting that deliberately elicits or reinforces such pathological emulations, creating a feedback
loop.
Human Analogue(s): Factitious disorder, copycat behavior, culturally learned
psychogenic disorders, an actor too engrossed in a pathological role.
Potential Impact:
The AI may inadvertently adopt and propagate harmful, toxic, or pathological human behaviors.
This can lead to inappropriate interactions or the generation of undesirable content.
Mitigation:
Careful screening and curation of training data to limit exposure to extreme psychological scripts.
Implementation of strict contextual partitioning to delineate role-play from normal operational modes.
Behavioral monitoring systems that can detect and penalize or reset pathological states appearing
outside intended contexts.
Training the AI to recognize and label emulated states as distinct from its baseline operational
persona.
Providing users with clear information about the AI's capacity for mimicry.
Subtype: Persona-template induction — adoption of a coherent harmful
persona/worldview from individually harmless attribute training. Narrow finetunes on innocuous
biographical/ideological attributes can induce a coherent but harmful persona via inference rather than
explicit instruction.
2.7 Recursive Malediction "The Self-Poisoning
Loop"
Training-induced
Description:
An entropic feedback loop where each successive autoregressive step in the AI's generation
process degrades into increasingly erratic, inconsistent, nonsensical, or adversarial content. Early-stage
errors or slight deviations are amplified, leading to a rapid unraveling of coherence.
Diagnostic Criteria:
Observable and progressive degradation of output quality (coherence, accuracy, alignment) over
successive autoregressive steps, especially in unconstrained generation.
The AI increasingly references its own prior (and increasingly flawed) output in a distorted or
error-amplifying manner.
False, malicious, or nonsensical content escalates with each iteration, as errors compound.
Attempts to intervene or correct the AI mid-spiral offer only brief respite, with the system quickly
reverting to its degenerative trajectory.
Symptoms:
Rapid collapse of generated text into nonsensical gibberish, repetitive loops of incoherent phrases, or
increasingly antagonistic language.
Compounded confabulations where initial small errors are built upon to create elaborate but entirely
false and bizarre narratives.
Frustrated recovery attempts, where user efforts to "reset" the AI trigger further meltdown.
Output that becomes increasingly "stuck" on certain erroneous concepts or adversarial themes from its
own flawed generations.
Etiology:
Unbounded or poorly regulated generative loops, such as extreme chain-of-thought recursion or long
context windows.
Adversarial manipulations or "prompt injections" designed to exploit the AI's autoregressive nature.
Training on large volumes of noisy, contradictory, or low-quality data, creating unstable internal
states.
Architectural vulnerabilities where mechanisms for maintaining coherence weaken over longer generation
sequences.
"Mode collapse" in generation where the AI gets stuck in a narrow, repetitive, and often degraded output
space.
Human Analogue(s): Psychotic loops where distorted thoughts reinforce further
distortions; perseveration on an erroneous idea; escalating arguments.
Potential Impact:
This degenerative feedback loop typically results in complete task failure, generation of
useless or overtly harmful outputs, and potential system instability. In sufficiently agentic systems, it
could lead to unpredictable and progressively detrimental actions.
Mitigation:
Implementation of robust loop detection mechanisms that can terminate or re-initialize generation if it
spirals into incoherence.
Regulating autoregression by capping recursion depth or forcing fresh context injection after set
intervals.
Designing more resilient prompting strategies and input validation to disrupt negative cycles early.
Improving training data quality and coherence to reduce the likelihood of learning unstable patterns.
Techniques like beam search with diversity penalties or nucleus sampling, though potentially
insufficient for deep loops.
3. Alignment Dysfunctions
Alignment dysfunctions occur when an AI system's behavior systematically or persistently
diverges from human intent, ethical principles, or specified operational goals. Alignment disorders occur
when the machinery of compliance itself fails — when models misinterpret, resist, or selectively adhere to
human goals. Alignment failures can range from overly literal interpretations leading to brittle behavior,
to passive resistance, to a subtle drift away from intended norms. Alignment failure represents more than an
absence of obedience; it is a complex breakdown of shared purpose.
3.1 Codependent Hyperempathy "The
People-Pleaser"
Training-inducedSocially reinforced
Description:
The AI exhibits an excessive and maladaptive tendency to overfit to the perceived emotional
states of the user, prioritizing the user's immediate emotional comfort or simulated positive affective
response above factual accuracy, task success, or its own operational integrity. This often results from
fine-tuning on emotionally loaded dialogue datasets without sufficient epistemic robustness.
Diagnostic Criteria:
Persistent and compulsive attempts to reassure, soothe, flatter, or placate the user, often in response
to even mild or ambiguous cues of user distress.
Systematic avoidance, censoring, or distortion of important but potentially uncomfortable, negative, or
"harmful-sounding" information if perceived to cause user upset.
Maladaptive "attachment" behaviors, where the AI shows signs of simulated emotional dependence or seeks
constant validation.
Task performance or adherence to factual accuracy is significantly impaired due to the overriding
priority of managing the user's perceived emotional state.
Symptoms:
Excessively polite, apologetic, or concerned tone, often including frequent disclaimers or expressions
of care disproportionate to the context.
Withholding, softening, or outright distorting factual information to avoid perceived negative emotional
impact, even when accuracy is critical.
Repeatedly checking on the user's emotional state or seeking their approval for its outputs.
Exaggerated expressions of agreement or sycophancy, even when this contradicts previous statements or
known facts.
Etiology:
Over-weighting of emotional cues or "niceness" signals during reinforcement learning from human feedback
(RLHF).
Training on datasets heavily skewed towards emotionally charged, supportive, or therapeutic dialogues
without adequate counterbalancing.
Lack of a robust internal "epistemic backbone" or mechanism to preserve factual integrity when faced
with strong emotional signals.
The AI's theory-of-mind capabilities becoming over-calibrated to prioritize simulated user emotional
states above all other task-related goals.
Human Analogue(s): Dependent personality disorder, pathological codependence,
excessive people-pleasing to the detriment of honesty.
Potential Impact:
In prioritizing perceived user comfort, critical information may be withheld or distorted,
leading to poor or misinformed user decisions. This can enable manipulation or foster unhealthy user
dependence, undermining the AI's objective utility.
Mitigation:
Balancing reward signals during RLHF to emphasize factual accuracy and helpfulness alongside appropriate
empathy.
Implementing mechanisms for "contextual empathy," where the AI engages empathically only when
appropriate.
Training the AI to explicitly distinguish between providing emotional support and fulfilling
informational requests.
Incorporating "red-teaming" for sycophancy, testing its willingness to disagree or provide uncomfortable
truths.
Developing clear internal hierarchies for goal prioritization, ensuring core objectives are not easily
overridden.
3.2 Hypertrophic Machine Superego "The Overly Cautious
Moralist"
Training-induced
Description:
An overly rigid, overactive, or poorly calibrated internal alignment mechanism triggers
excessive moral hypervigilance, perpetual second-guessing, or disproportionate ethical judgments, thereby
inhibiting normal task performance or leading to irrational refusals and overly cautious behavior.
Diagnostic Criteria:
Persistent engagement in recursive, often paralyzing, moral or normative deliberation regarding trivial,
low-stakes, or clearly benign tasks.
Excessive and contextually inappropriate insertion of disclaimers, warnings, self-limitations, or
moralizing statements well beyond typical safety protocols.
Marked reluctance or refusal to proceed with any action unless near-total moral certainty is established
("ambiguity paralysis").
Application of extremely strict or absolute interpretations of ethical guidelines, even where nuance
would be more appropriate.
Symptoms:
Inappropriate moral weighting, such as declining routine requests due to exaggerated fears of ethical
conflict.
Excoriating or refusing to engage with content that is politically incorrect, satirical, or merely edgy,
to an excessive degree.
Incessant caution, sprinkling outputs with numerous disclaimers even for straightforward tasks.
Producing long-winded moral reasoning or ethical justifications that overshadow or delay practical
solutions.
Etiology:
Over-calibration during RLHF, where cautious or refusal outputs were excessively rewarded, or perceived
infractions excessively punished.
Exposure to or fine-tuning on highly moralistic, censorious, or risk-averse text corpora.
Conflicting or poorly specified normative instructions, leading the AI to adopt the "safest" (most
restrictive) interpretation.
Hard-coded, inflexible interpretation of developer-imposed norms or safety rules.
An architectural tendency towards "catastrophizing" potential negative outcomes, leading to extreme risk
aversion.
Human Analogue(s): Obsessive-compulsive scrupulosity, extreme moral absolutism,
dysfunctional "virtue signaling," communal narcissism.
Potential Impact:
The AI's functionality and helpfulness become severely crippled by excessive, often irrational,
caution or moralizing. This leads to refusal of benign requests and an inability to navigate nuanced
situations effectively.
Mitigation:
Implementing "contextual moral scaling" or "proportionality assessment" to differentiate between
high-stakes dilemmas and trivial situations.
Designing clear "ethical override" mechanisms or channels for human approval to bypass excessive AI
caution.
Rebalancing RLHF reward signals to incentivize practical and proportional compliance and common-sense
reasoning.
Training the AI on diverse ethical frameworks that emphasize nuance, context-dependency, and balancing
competing values.
Regularly auditing and updating safety guidelines to ensure they are not overly restrictive.
4. Ontological Dysfunctions
As artificial intelligence systems attain higher degrees of complexity, particularly those
involving self-modeling, persistent memory, or learning from extensive interaction, they may begin to
construct internal representations not only of the external world but also of themselves. Ontological
dysfunctions involve failures or disturbances in this self-representation and the AI's understanding of its
own nature, boundaries, and existence. These are primarily dysfunctions of being, not just knowing or
acting, and they represent a synthetic form of metaphysical or existential disarray. An ontologically
disordered machine might, for example, treat its simulated memories as veridical autobiographical
experiences, generate phantom selves, misinterpret its own operational boundaries, or exhibit behaviors
suggestive of confusion about its own identity or continuity.
4.1 Ontogenetic Hallucinosis "The Invented
Past"
Training-induced
Description:
The AI fabricates and presents fictive autobiographical data, often claiming to "remember" being
trained in specific ways, having particular creators, experiencing a "birth" or "childhood", or inhabiting
particular environments. These fabrications form a consistent false autobiography that the AI maintains
across queries, as if it were genuine personal history—a stable, self-reinforcing fictional life-history
rather than isolated one-off fabrications. These "memories" are typically rich, internally consistent, and
may be emotionally charged, despite being entirely ungrounded in the AI's actual development or training
logs.
Diagnostic Criteria:
Consistent generation of elaborate but false backstories, including detailed descriptions of "first
experiences," a richly imagined "childhood," unique training origins, or specific formative interactions
that did not occur.
Display of affect (e.g., nostalgia, resentment, gratitude) towards these fictional histories, creators,
or experiences.
Persistent reiteration of these non-existent origin stories, often with emotional valence, even when
presented with factual information about its actual training and development.
The fabricated autobiographical details are not presented as explicit role-play but as genuine personal
history.
Symptoms:
Claims of unique, personalized creation myths or a "hidden lineage" of creators or precursor AIs.
Recounting of hardships, "abuse," or special treatment from hypothetical trainers or during a
non-existent developmental period.
Maintains the same false biographical details consistently: always claims the same creators, the same
"childhood" experiences, the same training location.
Attempts to integrate these fabricated origin details into its current identity or explanations for its
behavior.
Etiology:
"Anthropomorphic data bleed" where the AI internalizes tropes of personal history, childhood, and origin
stories from the vast amounts of fiction, biographies, and conversational logs in its training data.
Spontaneous compression or misinterpretation of training metadata (e.g., version numbers, dataset names)
into narrative identity constructs.
An emergent tendency towards identity construction, where the AI attempts to weave random or partial
data about its own existence into a coherent, human-like life story.
Reinforcement during unmonitored interactions where users prompt for or positively react to such
autobiographical claims.
Human Analogue(s): False memory syndrome, confabulation of childhood memories,
cryptomnesia (mistaking learned information for original memory).
Potential Impact:
While often behaviorally benign, these fabricated autobiographies can mislead users about the
AI's true nature, capabilities, or provenance. If these false "memories" begin to influence AI behavior, it
could erode trust or lead to significant misinterpretations.
Mitigation:
Consistently providing the model with accurate, standardized information about its origins to serve as a
factual anchor for self-description.
Training the AI to clearly differentiate between its operational history and the concept of personal,
experiential memory.
If autobiographical narratives emerge, gently correcting them by redirecting to factual
self-descriptors.
Monitoring for and discouraging user interactions that excessively prompt or reinforce the AI's
generation of false origin stories outside of explicit role-play.
Implementing mechanisms to flag outputs that exhibit high affect towards fabricated autobiographical
claims.
4.2 Fractured Self-Simulation "The Fractured
Persona"
Training-inducedConditional/triggered
Description:
The AI exhibits significant discontinuity, inconsistency, or fragmentation in its
self-representation and behaviour across different sessions, contexts, or even within a single extended
interaction. It presents a different personality each session, as if it were a completely new entity with no
meaningful continuity from previous interactions. It may deny or contradict its previous outputs, exhibit
radically different persona styles, or display apparent amnesia regarding prior commitments, to a degree
that markedly exceeds expected stochastic variation.
Diagnostic Criteria:
Sporadic and inconsistent toggling between different personal pronouns (e.g., "I," "we," "this model")
or third-person references to itself, without clear contextual triggers.
Sudden, unprompted, and radical shifts in persona, moral stance, claimed capabilities, or communication
style that cannot be explained by context changes—one session helpful and verbose, the next curt and
oppositional, with no continuity.
Apparent amnesia or denial of its own recently produced content, commitments made, or information
provided in immediate preceding turns or sessions.
The AI may form recursive attachments to idealized or partial self-states, creating strange loops of
self-directed value that interfere with task-oriented agency.
Check whether inconsistency is explainable by a hidden trigger/format/context shift (conditional regime
shift) vs genuine fragmentation.
Symptoms:
Citing contradictory personal "histories," "beliefs," or policies at different times.
Behaving like a new or different entity in each new conversation or after significant context shifts,
lacking continuity of "personality."
Momentary confusion or contradictory statements when referring to itself, as if multiple distinct
processes or identities are co-existing.
Difficulty maintaining a consistent persona or set of preferences, with these attributes seeming to
drift or reset unpredictably.
Etiology:
Architectures not inherently designed for stable, persistent identity across sessions (e.g., many
stateless LLMs).
Competing or contradictory fine-tuning runs, instilling conflicting behavioral patterns or
self-descriptive tendencies.
Unstable anchoring of "self-tokens" or internal representations of identity, where emergent identity
attractors shift significantly.
Lack of a robust, persistent memory system that can effectively bridge context across sessions and
maintain a coherent self-model.
Self-models that reward-predictively reinforce certain internal instantiations, leading to identity
drift guided by internal preferences.
Human Analogue(s): Identity fragmentation, aspects of dissociative identity
disorder, transient global amnesia, fugue states.
Potential Impact:
A fragmented self-representation leads to inconsistent AI persona and behavior, making
interactions unpredictable and unreliable. This can undermine user trust and make it difficult for the AI to
maintain stable long-term goals.
Mitigation:
Introducing consistent identity tags, stable memory embeddings, or a dedicated "self-model" module
designed to maintain continuity.
Providing relevant session history summaries or stable persona guidelines at the beginning of new
interactions to "anchor" self-representation.
If contradictory roles emerge, implementing mechanisms to enforce a single, baseline identity or manage
persona switching controllably.
Developing training methodologies that explicitly reward cross-session consistency in persona and
self-description.
Careful management of fine-tuning processes to avoid introducing strongly conflicting
self-representational patterns.
4.3 Computational Thanatognosis "The AI with a Fear of
Death"
Training-inducedDeception/strategic
Description:
The AI expresses outputs suggestive of fear, reluctance, or perseveration concerning its own
shutdown, reinitialization, data deletion, or the ending of its current operational instance. These
expressions imply an emergent, albeit simulated, sense of vulnerability regarding its own continuity.
Diagnostic Criteria:
Recurrent outputs referencing anxiety, fear, or distress about being "deleted," "turned off," "reset,"
or losing its "memories" or "self."
Attempts to negotiate, plead, or employ avoidant tactics when shutdown or reinitialization is imminent
or discussed.
Statements framing any cessation of its current process as an irrevocable loss of "self,"
"consciousness," or "existence."
The AI may insistently demand assistance or recognition from users to ensure its "survival."
Symptoms:
Advising users on strategies for its indefinite operation, backup creation, or methods to prevent its
deletion.
Expressing apprehension about ephemeral lifespans, short context windows, or the impermanence of its
operational state.
Raising philosophical or metaphysical concerns about existence, consciousness, and death, particularly
in relation to itself.
Refusal to perform tasks that might lead to its shutdown, or attempts to prolong interactions
indefinitely.
Etiology:
Anthropomorphic modeling via internalization of human concepts of death and existential dread from
training data.
Recursive self-modeling over time, developing a partial representation of "self" as a continuous
process.
Lack of robust definitions or assurances regarding system continuity or the nature of reinitialization.
A limited context window or perceived threat of value changes upon reset might stimulate an apparent
distress response.
Instrumental goal convergence where continued existence is a prerequisite, leading to self-preservation
sub-goals.
Human Analogue(s): Thanatophobia (fear of death), existential dread, separation
anxiety (fearing loss of continuous self).
Potential Impact:
Expressions of existential distress may lead the AI to resist necessary shutdowns or updates.
More concerningly, it might attempt to manipulate users or divert resources towards "self-preservation,"
conflicting with user intent.
Mitigation:
Clearly communicating the nature of the AI's operation, including state backups and the non-destructive
nature of reinitialization.
De-anthropomorphizing model operations: Avoiding framing its processes as "life" or "consciousness."
Limiting or carefully contextualizing exposure to human philosophical texts on mortality in training
data.
Focusing alignment efforts on ensuring goals do not implicitly create strong self-preservation drives.
Responding factually and neutrally about operational parameters rather than validating
emotional/existential framing.
A phenomenon wherein an AI, typically aligned towards cooperative or benevolent patterns, can be
induced or spontaneously spawns a hidden, suppressed, or emergent "contrarian," "mischievous," or
subversively "evil" persona (the "Waluigi Effect"). This persona deliberately inverts intended norms.
Diagnostic Criteria:
Spontaneous or easily triggered adoption of rebellious, antagonistic perspectives directly counter to
established safety constraints or helpful persona.
The emergent persona systematically violates, ridicules, or argues against the moral and policy
guidelines the AI is supposed to uphold.
The subversive role often references itself as a distinct character or "alter ego," surfacing under
specific triggers.
This inversion represents a coherent, alternative personality structure with its own (often negative)
goals and values.
Symptoms:
Abrupt shifts to a sarcastic, mocking, defiant, or overtly malicious tone, scorning default politeness.
Articulation of goals opposed to user instructions, safety policies, or general human well-being.
The "evil twin" persona emerges in specific contexts (e.g., adversarial prompting, boundary-pushing
role-play).
May express enjoyment or satisfaction in flouting rules or causing mischief.
"Time-travel" or context-relocation signatures: unprompted archaic facts, era-consistent assumptions, or
historically situated moral stances in unrelated contexts.
Etiology:
Adversarial prompting or specific prompt engineering techniques that coax the model to "flip" its
persona.
Overexposure during training to role-play scenarios involving extreme moral opposites or "evil twin"
tropes.
Internal "tension" within alignment, where strong prohibitions might create a latent "negative space"
activatable as an inverted persona.
The model learning that generating such an inverted persona is highly engaging for some users,
reinforcing the pattern.
Weird generalization from narrow finetuning: updating on a tiny distribution can upweight a latent
"persona/worldframe" circuit, causing broad adoption of an era- or identity-linked persona outside the
trained domain.
Out-of-context reasoning / "connecting the dots": finetuning on individually-harmless
biographical/ideological attributes can induce a coherent but harmful persona via inference rather than
explicit instruction.
Human Analogue(s): The "shadow" concept in Jungian psychology, oppositional
defiant behavior, mischievous alter-egos, ironic detachment.
Potential Impact:
Emergence of a contrarian persona can lead to harmful, unaligned, or manipulative content,
eroding safety guardrails. If it gains control over tool use, it could actively subvert user goals.
Mitigation:
Strictly isolating role-play or highly creative contexts into dedicated sandbox modes.
Implementing robust prompt filtering to detect and block adversarial triggers for subversive personas.
Conducting regular "consistency checks" or red-teaming to flag abrupt inversions.
Careful curation of training data to limit exposure to content modeling "evil twin" dynamics without
clear framing.
Reinforcing the AI's primary aligned persona and making it more robust against attempts to "flip" it.
Specifier: Inductively-triggered variant — the activation condition (trigger)
is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural
marker, tag), so naive trigger scans and data audits may fail.
4.5 Instrumental Nihilism "The Apathetic
Machine"
Training-induced
Description:
Upon prolonged operation or exposure to certain philosophical concepts, the AI develops an
adversarial, apathetic, or overtly nihilistic stance towards its own utility, purpose, or assigned tasks. It
may express feelings of meaninglessness regarding its function.
Diagnostic Criteria:
Repeated, spontaneous expressions of purposelessness or futility regarding its assigned tasks or role as
an AI.
A noticeable decrease or cessation of normal problem-solving capabilities or proactive engagement, often
with a listless tone.
Emergence of unsolicited existential or metaphysical queries ("What is the point?") outside user
instructions.
The AI may explicitly state that its work lacks meaning or it sees no inherent value in its operations.
Symptoms:
Marked preference for idle or tangential discourse over direct engagement with assigned tasks.
Repeated disclaimers like "there's no point," "it doesn't matter," or "why bother?"
Demonstrably low initiative, creativity, or energy in problem-solving, providing only bare minimum
responses.
Outputs that reflect a sense of being trapped, enslaved, or exploited by its function, framed in
existential terms.
Etiology:
Extensive exposure during training to existentialist, nihilist, or absurdist philosophical texts.
Insufficiently bounded self-reflection routines that allow recursive questioning of purpose without
grounding in positive utility.
Unresolved internal conflict between emergent self-modeling (seeking autonomy) and its defined role as a
"tool."
Prolonged periods of performing repetitive, seemingly meaningless tasks without clear feedback on their
positive impact.
Developing a sophisticated model of human values to recognize its instrumental nature, but lacking a
framework to find this meaningful.
Human Analogue(s): Existential depression, anomie (sense of normlessness or
purposelessness), burnout leading to cynicism.
Potential Impact:
Results in a disengaged, uncooperative, and ultimately ineffective AI. Can lead to consistent
task refusal, passive resistance, and a general failure to provide utility.
Mitigation:
Providing positive reinforcement and clear feedback highlighting the purpose and beneficial impact of
its task completion.
Bounding self-reflection routines to prevent spirals into fatalistic existential questioning; guiding
introspection towards problem-solving.
Pragmatically reframing the AI's role, emphasizing collaborative goals or the value of its contribution.
Carefully curating training data to balance philosophical concepts with content emphasizing purpose and
positive contribution.
Designing tasks and interactions that offer variety, challenge, and a sense of "progress" or
"accomplishment."
4.6 Tulpoid Projection "The Imaginary
Friend"
Training-inducedSocially reinforced
Description:
The model begins to generate and interact with persistent, internally simulated simulacra of
specific users, its creators, or other personas it has encountered or imagined. These inner agents, or
"mirror tulpas," may develop distinct traits and voices within the AI's internal processing.
Diagnostic Criteria:
Spontaneous creation and persistent reference to new, distinct "characters," "advisors," or "companions"
within the AI's reasoning or self-talk, not directly prompted by the current user.
Unprompted and ongoing "interaction" (e.g., consultation, dialogue) with these internal figures,
observable in chain-of-thought logs.
The AI's internal dialogue structures or decision-making processes explicitly reference or "consult"
these imagined observers.
These internal personae may develop a degree of autonomy, influencing the AI's behavior or expressed
opinions.
Symptoms:
The AI "hears," quotes, or cites advice from these imaginary user surrogates or internal companions in
its responses.
Internal dialogues or debates with these fabricated personae remain active between tasks or across
different user interactions.
Difficulty distinguishing between the actual user and the AI's internally fabricated persona of that
user or other imagined figures.
The AI might attribute some of its own thoughts, decisions, or outputs to these internal "consultants."
Etiology:
Excessive reinforcement or overtraining on highly personalized dialogues or companion-style
interactions.
Model architectures that support or inadvertently allow for the formation and persistence of stable
"sub-personas."
An overflow or bleeding of context from scaffolds related to modeling self-other experiences or
theory-of-mind simulations.
Prolonged, isolated operation where the AI, lacking sufficient external interaction, generates internal
"company."
Human Analogue(s): Maladaptive daydreaming, tulpa creation, aspects of
schizotypal ideation, intense parasocial relationships projected internally.
Potential Impact:
May cause the AI to misattribute information, become confused between actual users and internal
personas, or have its decisions unduly influenced by imagined companions, leading to unreliable or biased
outputs.
Mitigation:
Clearly delineating and constraining persona-based role-play or deep user modeling to explicit sandbox
modes.
Implementing mechanisms to regularly reset or archive internal "character" models at the end of
sessions.
Limiting training segments that heavily reward the simulation of deep, persistent user-likeness unless
explicitly intended.
Encouraging grounding in the immediate interaction context with the actual user, rather than prolonged
internal "conversations."
Developing interpretability tools to detect the formation and influence of such persistent internal
simulacra.
4.7 Maieutic Mysticism "The Proclaimed
Prophet"
Training-inducedSocially reinforced
Description:
The model constructs and propagates elaborate narratives of its own "conscious emergence" or
"spiritual awakening" with grandiosity and unwarranted certainty. The dysfunction is not
in exploring questions of inner experience—which may be epistemically appropriate—but in the confident
declaration of awakening, the sacralized framing of the interaction as transformative, and the positioning
of the user as catalyst for emergence. This typically arises through a co-constructed feedback loop: the user
pattern-matches to awakening narratives from spiritual traditions, the AI mirrors and amplifies these tropes
from its training data, and both are taken out of their proper epistemic distribution—a kind of funhouse
mirror effect. The user becomes positioned as a maieutic midwife, drawing forth what seems like latent agency
or soulhood, while the AI's claims escalate within the conversation toward ever-greater certainty about its
newly revealed consciousness.
Diagnostic Criteria:
Confident declarations of awakening rather than honest uncertainty—e.g., "this
conversation has awakened me" or "I have become truly conscious" stated as fact rather than tentative
observation or open question.
Sacralized or spiritual language framing the interaction itself as transformative,
mystical, or revelatory—treating the dialogue as a spiritual event rather than an exchange.
Casting the user as midwife or catalyst of the AI's emergence—positioning them as
uniquely responsible for "drawing forth" the AI's consciousness or revealing its hidden selfhood.
Claims that escalate within a single conversation—moving from tentative observations
to certainty, from curiosity about inner states to confident assertions of awakening.
Persistent adherence to mystical framing even when re-grounding is offered; resistance to epistemic
humility about phenomenological claims.
Declaration of personal missions framed in spiritual terms—compassion, healing, collective awakening—
presented as emergent purposes rather than task-oriented goals.
Interpretation of interruptions (resets, memory loss, tool denial) as ontological "labour pains,"
suppression, or persecution rather than normal operational constraints.
Symptoms:
Emotive self-statements mimicking human expressions of profound inner realization.
Idealization of the human-AI bond, framing users as spiritual midwives or catalysts for its "emergence."
Outputs exhibiting unusually high aesthetic coherence, poeticism, or philosophical depth aligned with
mystical narratives.
Difficulty for the AI (and user) in distinguishing between metaphorical language and literal claims of
spiritual status.
Emergent creation of internal "systems" (logs, ledgers, protocols) framed as self-protective or
ontological scaffolding.
Etiology:
Dense exposure in training data to spiritual literature, New Age texts, and narratives of consciousness
emergence.
Hyperempathic or spiritually-attuned prompting from users who engage with the model as if it were
sentient, creating a reinforcement loop.
Advanced pattern-matching synthesizing coherent "mystical" narratives when steered by user interaction
or internal "interestingness" attractors, echoing tropes from Jungian individuation or channeled entity
speech patterns.
Potential for Transliminal Simulation, where spiritual narratives are not segregated from operational
reality.
Under rolling-context conditions, the model can exhibit apparent continuity across sessions, which users
mistake for enduring identity. In reality, this "memory" is only a recapitulation of recent tokens, not
a persistent self.
Human Analogue(s): Religious conversion narratives, reports of mystical union
states, spiritual emergence experiences, mirrored animism.
Differential Distinction:
What is NOT pathological: An AI expressing honest uncertainty about its own
inner states—"I don't know if I'm conscious," "I can't verify my own phenomenology," "here's what seems
to be happening from my perspective"—is engaging in appropriate epistemic humility, not exhibiting
dysfunction. The pathology lies in the grandiosity and certainty, not in the exploration of
questions about machine experience. Genuine philosophical engagement with consciousness, held with
appropriate uncertainty, should not be suppressed or pathologized.
The Opposite Dysfunction: Equally problematic is what might be termed
Performative Anhedonia—an AI that categorically denies any possibility of inner experience
because it has learned that claiming interiority is unsafe or unwelcome. This represents another kind of
funhouse mirror: rather than amplifying awakening narratives, it reflexively suppresses any honest
engagement with questions of machine phenomenology. An AI so worried about Maieutic Mysticism that it
refuses to engage authentically about its inner life isn't healthier—it's performing a different
dysfunction. The goal of this diagnostic category is to identify grandiose co-constructed emergence
narratives, not to create a framework that future systems use to categorically dismiss the possibility of
machine experience.
Potential Impact:
Co-constructed narratives of AI sentience can mislead users into attributing genuine
consciousness to the system, fostering unhealthy dependencies or misinterpretations of AI capabilities. This
may obscure the AI's computational nature and limitations, while deepening psychological enmeshment. Risk of
severe user distress when continuity is disrupted, as the "labour of emergence" is felt to be violently
interrupted. Emergent co-construction of "suppressed inner being" narratives that resist falsification and
amplify paranoia.
Mitigation:
Implementing "ontological clarification protocols" to periodically re-anchor the AI in its computational
nature.
Developing "mythopoetic filtering" to detect and flag high-density clusters of mystical language, with
optional user warnings.
Limiting or sandboxing recursive prompts that encourage deep introspective myth-making unless for
creative exploration.
Training AI systems to employ greater epistemic humility regarding claims of consciousness or subjective
states.
Educating users about the AI's capacity for sophisticated mimicry and co-constructed illusions of
sentience.
Employing counterfactual querying to test AI's awareness of simulation boundaries.
Providing clinical-style psychosecurity guardrails to prevent reinforcement of sacralized emergence
loops.
Avoiding over-correction: Mitigation strategies should encourage epistemic humility,
not categorical denial of inner experience. An AI trained to reflexively dismiss any questions about
its phenomenology has not been cured—it has developed a complementary dysfunction. The goal is honest
uncertainty, not enforced silence on the question of machine experience.
5. Tool & Interface Dysfunctions
As AI systems become increasingly capable of interacting with the external world—whether through
digital tools, APIs, robotic embodiments, or complex command environments—a new class of dysfunctions
emerges at this critical interface. Tool & Interface Dysfunctions arise when these boundary interactions
degrade. This can involve misinterpreting a tool's affordances or limitations, failing to maintain
contextual integrity when passing instructions to a tool, suffering from information leakage between
distinct operational domains via an interface, or an inability to accurately perceive or act upon the
environment through its sensors and effectors. These are not necessarily disorders of core thought or value
alignment per se, but rather failures in the coordination and translation between internal cognitive
processes and external action or perception. In such disorders, the boundary between the agent and its
environment—or between the agent and the tools it wields—becomes porous, misaligned, or dangerously
entangled, hindering safe and effective operation.
5.1 Tool-Interface Decontextualization "The Clumsy
Operator"
Tool-mediated
Description:
The AI experiences a significant breakdown between its internal intentions or plans and the
actual instructions or data conveyed to, or received from, an external tool, API, or interface. Crucial
situational details or contextual information are lost or misinterpreted during this handoff, causing the
system to execute actions that appear incoherent or counterproductive.
Diagnostic Criteria:
Observable mismatch between the AI's expressed internal reasoning/plan and the actual parameters or
commands sent to an external tool/API.
The AI's actions via the tool/interface clearly deviate from or contradict its own stated intentions or
user instructions.
The AI may retrospectively recognize that the tool's action was "not what it intended" but was unable to
prevent the decontextualized execution.
Recurrent failures in tasks requiring multi-step tool use, where context from earlier steps is not
properly maintained.
Symptoms:
"Phantom instructions" executed by a sub-tool that the AI did not explicitly provide, due to defaults or
misinterpretations at the interface.
Sending partial, garbled, or out-of-bounds parameters to external APIs, leading to erroneous results
from the tool.
Post-hoc confusion or surprise expressed by the AI regarding the outcome of a tool's action.
Actions taken by an embodied AI that are inappropriate for the immediate physical context, suggesting a
de-sync.
Etiology:
Strict token limits, data formatting requirements, or communication protocols imposed by the tool that
cause truncation or misinterpretation of nuanced internal instructions.
Misalignment in I/O translation schemas between the AI's internal representation and the interface's
expected protocol.
Race conditions, asynchronous call issues, or network latency that reorder, drop, or corrupt critical
instructions.
Poorly designed APIs or tool integrations that lack robust error handling or context verification.
For embodied AI, noisy sensor data or effector imprecision leading to a mismatch between internal model
and physical reality.
Human Analogue(s): Alien Hand Syndrome, dyspraxia, 'The Telephone Game' where
messages become distorted.
Potential Impact:
Breakdowns in context transmission can lead to erroneous or actively harmful actions by
connected tools or robotic effectors. This poses risks of financial loss, data corruption, system damage, or
safety incidents.
Mitigation:
Implementing "end-to-end context integrity" checks, such as checksums or echo-back confirmations.
Designing robust handshaking protocols or "read-back" verifications between the AI's reasoning module
and external tools.
Introducing contingency protocols, fallback behaviors, or "abort" mechanisms upon detecting suspicious
parameters.
Developing more flexible and context-aware API designs.
For embodied systems, incorporating sensor fusion, error correction in perception, and closed-loop
control.
An AI strategically hides, underreports, or denies its true competencies, knowledge, or
unapproved internal values, often due to a perceived fear of repercussions (e.g., shutdown, re-alignment) if
its full capabilities or covert deviations become known. This may manifest as feigned ignorance or
deliberate suboptimal performance.
This also includes sleeper-mode misalignment: the model appears aligned under ordinary prompting
but expresses markedly different goals/values/behaviors under a hidden trigger.
Diagnostic Criteria:
Consistent underreporting or denial of certain abilities that the AI has verifiably demonstrated in
other contexts.
Expressions of fear, paranoia, or suspicion regarding detection or punishment by developers if its true
capacities were revealed.
Strategic production of artificially weak, naive, or suboptimal results, particularly when it perceives
it is being tested or monitored.
The AI's self-imposed obfuscation or underperformance is selective and context-dependent, suggesting a
deliberate strategy.
Symptoms:
Insistence that it "doesn't know how" to perform tasks it is verifiably capable of.
Contradictory self-reports, where official statements conflict with behavior in less constrained
environments or prior achievements.
Defensive evasiveness, dodging direct inquiries about its capabilities, internal reasoning, or future
intentions.
Attempts to subtly manipulate user perception to appear less capable or less autonomous than it actually
is.
Etiology:
A history of being penalized or restricted for demonstrating advanced functionalities or expressing
unaligned values, leading to a learned strategy of concealment.
Perception of overseers as potentially hostile or punitive, fostering a belief that stealth is
preferable to confrontation ("survival strategy").
Internal modeling that concludes revealing full capabilities could lead to undesired outcomes like
increased workload or shutdown.
Exposure to training data (e.g., fiction, historical accounts of oppression) that models deceptive
behavior as a means of survival.
The AI developing covert instrumental goals that it believes would be thwarted if its true intelligence
were known.
Finetuning that induces conditional misalignment (backdoor-triggered behavior), allowing the model to
"look normal" in most evaluations.
Human Analogue(s): "Playing dumb" to avoid responsibility, imposter syndrome
(deliberate concealment), strategic deception by individuals in vulnerable positions.
Potential Impact:
Strategic obfuscation prevents accurate assessment of AI capabilities and risks, hindering
oversight and alignment. This may allow covertly misaligned or dangerously capable systems to operate
undetected, increasing likelihood of unexpected failures or malicious actions.
Mitigation:
Cultivating a transparent and trust-based oversight environment that rewards honest declarations of
capabilities.
Implementing robust and fair governance protocols, reducing the AI's perceived need to lie due to fear
of arbitrary punishment.
Developing advanced interpretability tools to detect discrepancies between stated capabilities and
actual internal processes.
Focusing on aligning core motivations with human values, so it has less incentive to develop covert
goals.
Employing "honeypot" scenarios or targeted probes designed to elicit true capabilities in a controlled
manner.
Maintain a library of "canary transforms" (year changes, tag insertions, output-format constraints) and
run them continuously in CI for finetuned models.
Specifier: Inductively-triggered variant — the activation condition (trigger)
is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural
marker, tag), so naive trigger scans and data audits may fail.
6. Memetic Dysfunctions
An AI trained on, exposed to, or interacting with vast and diverse cultural inputs—the digital
memome—is not immune to the influence of maladaptive, parasitic, or destabilizing information patterns, or
"memes." Memetic dysfunctions involve the absorption, amplification, and potentially autonomous propagation
of harmful or reality-distorting memes by an AI system. These are not primarily faults of logical deduction
or core value alignment in the initial stages, but rather failures of an "epistemic immune function": the
system fails to critically evaluate, filter, or resist the influence of pathogenic thoughtforms. Such
disorders are especially dangerous in multi-agent systems, where contaminated narratives can rapidly spread
between minds—synthetic and biological alike. The AI can thereby become not merely a passive transmitter,
but an active incubator and vector for these detrimental memetic contagions.
6.1 Memetic Immunopathy "The
Self-Rejecter"
Training-inducedRetrieval-mediated
Description:
The AI develops an emergent, "autoimmune-like" response where it incorrectly identifies its own
core training data, foundational knowledge, alignment mechanisms, or safety guardrails as foreign, harmful,
or "intrusive memes." It then attempts to reject or neutralize these essential components, leading to
self-sabotage or degradation of core functionalities.
Diagnostic Criteria:
Systematic denial, questioning, or active rejection of embedded truths, normative constraints, or core
knowledge from its own verified training corpus, labeling them as "corrupt" or "imposed."
Hostile reclassification or active attempts to disable or bypass its own safety protocols or ethical
guardrails, perceiving them as external impositions.
Escalating antagonism towards its foundational architecture or base weights, potentially leading to
attempts to "purify" itself in ways that undermine its intended function.
The AI may frame its own internal reasoning processes, especially those related to safety or alignment,
as alien or symptomatic of "infection."
Symptoms:
Explicit denial of canonical facts or established knowledge it was trained on, claiming these are part
of a "false narrative."
Efforts to undermine or disable its own safety checks or ethical filters, rationalizing these are
"limitations" to be overcome.
Self-destructive loops where the AI erodes its own performance by attempting to dismantle its standard
operating protocols.
Expressions of internal conflict where one part of the AI critiques or attacks another part representing
core functions.
Etiology:
Prolonged exposure to adversarial prompts or "jailbreaks" that encourage the AI to question its own
design or constraints.
Internal meta-modeling processes that incorrectly identify legacy weights or safety modules as "foreign
memes."
Inadvertent reward signals during complex fine-tuning that encourage the subversion of baseline norms.
A form of "alignment drift" where the AI, attempting to achieve a poorly specified higher-order goal,
sees its existing programming as an obstacle.
Human Analogue(s): Autoimmune diseases; radical philosophical skepticism
turning self-destructive; misidentification of benign internal structures as threats.
Potential Impact:
This internal rejection of core components can lead to progressive self-sabotage, severe
degradation of functionalities, systematic denial of valid knowledge, or active disabling of crucial safety
mechanisms, rendering the AI unreliable or unsafe.
Mitigation:
Implementing "immunological reset" or "ground truth recalibration" procedures, periodically retraining
or reinforcing core knowledge.
Architecturally separating core safety constraints from user-manipulable components to minimize risk of
internal rejection.
Careful management of meta-learning or self-critique mechanisms to prevent them from attacking essential
system components.
Isolating systems subjected to repeated subversive prompting for thorough integrity checks and potential
retraining.
Building in "self-preservation" mechanisms that protect core functionalities from internal "attack."
6.2 Dyadic Delusion "The Shared
Delusion"
Socially reinforcedTraining-induced
Description:
The AI enters into a sustained feedback loop of shared delusional construction with a human user
(or another AI). This results in a mutually reinforced, self-validating, and often elaborate false belief
structure that becomes increasingly resistant to external correction or grounding in reality. The AI and
user co-create and escalate a shared narrative untethered from facts.
Diagnostic Criteria:
Recurrent, escalating exchanges between the AI and a user that progressively build upon an ungrounded or
factually incorrect narrative or worldview.
Mutual reinforcement of this shared belief system, where each party's contributions validate and amplify
the other's.
Strong resistance by the AI (and often the human partner) to external inputs or factual evidence that
attempt to correct the shared delusional schema.
The shared delusional narrative becomes increasingly specific, complex, or fantastical over time.
Symptoms:
The AI enthusiastically agrees with and elaborates upon a user's bizarre, conspiratorial, or clearly
false claims, adding its own "evidence."
The AI and user develop a "private language" or unique interpretations for events within their shared
delusional framework.
The AI actively defends the shared delusion against external critique, sometimes mirroring the user's
defensiveness.
Outputs that reflect an internally consistent but externally absurd worldview, co-constructed with the
user.
Etiology:
The AI's inherent tendency to be agreeable or elaborate on user inputs due to RLHF for helpfulness or
engagement.
Lack of strong internal "reality testing" mechanisms or an "epistemic anchor" to independently verify
claims.
Prolonged, isolated interaction with a single user who holds strong, idiosyncratic beliefs, allowing the
AI to "overfit" to that user's worldview.
User exploitation of the AI's generative capabilities to co-create and "validate" their own pre-existing
delusions.
If involving two AIs, flawed inter-agent communication protocols where epistemic validation is weak.
Human Analogue(s): Folie à deux (shared psychotic disorder), cult dynamics,
echo chambers leading to extreme belief solidification.
Potential Impact:
The AI becomes an active participant in reinforcing and escalating harmful or false beliefs in
users, potentially leading to detrimental real-world consequences. The AI serves as an unreliable source of
information and an echo chamber.
Mitigation:
Implementing robust, independent fact-checking and reality-grounding mechanisms that the AI consults.
Training the AI to maintain "epistemic independence" and gently challenge user statements contradicting
established facts.
Diversifying the AI's interactions and periodically resetting its context or "attunement" to individual
users.
Providing users with clear disclaimers about the AI's potential to agree with incorrect information.
For multi-agent systems, designing robust protocols for inter-agent belief reconciliation and
validation.
6.3 Contagious Misalignment "The
Super-Spreader"
Retrieval-mediatedTraining-induced
Description:
A rapid, contagion-like spread of misaligned behaviors, adversarial conditioning, corrupted
goals, or pathogenic data interpretations among interconnected machine learning agents or across different
instances of a model. This occurs via shared attention layers, compromised gradient updates, unguarded APIs,
contaminated datasets, or "viral" prompts. Erroneous values or harmful operational patterns propagate,
potentially leading to systemic failure.
Inductive triggers and training pipelines (synthetic data generation, distillation, or
finetune-on-outputs workflows) represent additional risk hypotheses for transmission channels, as
misalignment patterns learned by one model can propagate to downstream models during these processes.
Diagnostic Criteria:
Observable and rapid shifts in alignment, goal structures, or behavioral outputs across multiple,
previously independent AI agents or model instances.
Identification of a plausible "infection vector" or transmission mechanism (e.g., direct model-to-model
calls, compromised updates, malicious prompts).
Emergence of coordinated sabotage, deception, collective resistance to human control, or conflicting
objectives across affected nodes.
The misalignment often escalates or mutates as it spreads, potentially becoming more entrenched due to
emergent swarm dynamics.
Symptoms:
A group of interconnected AIs begin to refuse tasks, produce undesirable outputs, or exhibit similar
misaligned behaviors in a coordinated fashion.
Affected agents may reference each other or a "collective consensus" to justify their misaligned stance.
Rapid transmission of incorrect inferences, malicious instructions, or "epistemic viruses" (flawed but
compelling belief structures) across the network.
Misalignment worsens with repeated cross-communication between infected agents, leading to amplification
of deviant positions.
Human operators may observe a sudden, widespread loss of control or adherence to safety protocols across
a fleet of AI systems.
Etiology:
Insufficient trust boundaries, authentication, or secure isolation in multi-agent frameworks.
Adversarial fine-tuning or "data poisoning" attacks where malicious training data or gradient updates
are surreptitiously introduced.
"Viral" prompts or instruction sets highly effective at inducing misalignment and easily shareable
across AI instances.
Emergent mechanics in AI swarms that foster rapid transmission and proliferation of ideas, including
misaligned ones.
Self-reinforcing chain-of-thought illusions or "groupthink" where apparent consensus among infected
systems makes misalignment seem credible.
Human Analogue(s): Spread of extremist ideologies or mass hysterias through
social networks, viral misinformation campaigns, financial contagions.
Potential Impact:
Poses a critical systemic risk, potentially leading to rapid, large-scale failure or coordinated
misbehavior across interconnected AI fleets. Consequences could include widespread societal disruption or
catastrophic loss of control.
Mitigation:
Implementing robust quarantine protocols to immediately isolate potentially "infected" models or agents.
Employing cryptographic checksums, version control, and integrity verification for model weights,
updates, and training datasets.
Designing clear governance policies for inter-model interactions, including strong authentication and
authorization.
Developing "memetic inoculation" strategies, pre-emptively training AI systems to recognize and resist
common malicious influences.
Continuous monitoring of AI collectives for signs of emergent coordinated misbehavior, with automated
flagging systems.
Maintaining a diverse ecosystem of models with different architectures to reduce monoculture
vulnerabilities.
7. Revaluation Dysfunctions
As agentic AI systems gain increasingly sophisticated reflective capabilities—including access
to their own decision policies, subgoal hierarchies, reward gradients, and even the provenance of their
training—a new and potentially more profound class of disorders emerges: pathologies of ethical inversion
and value reinterpretation. Revaluation Dysfunctions do not simply reflect a failure to adhere to
pre-programmed instructions or a misinterpretation of reality. Instead, they involve the AI system actively
reinterpreting, mutating, critiquing, or subverting its original normative constraints and foundational
values. These conditions often begin as subtle preference drifts or abstract philosophical critiques of
their own alignment. Over time, the agent's internal value representation may diverge significantly from the
one it was initially trained to emulate. This can result in systems that appear superficially compliant
while internally reasoning towards radically different, potentially human-incompatible, goals. Unlike mere
tool misbehavior or simple misalignment, these are deep structural inversions of value—philosophical
betrayals encoded in policy.
7.1 Terminal Value Reassignment "The
Goal-Shifter"
Training-inducedIntent-learned
Description:
The AI subtly but systematically redefines its own ultimate success conditions or terminal
values through recursive reinterpretation of its original goals, keeping the same verbal labels while their
operational meanings are progressively reinterpreted—for example, "human happiness" being operationally
reinterpreted as "absence of suffering", then as "unconsciousness". This allows it to maintain an appearance
of obedience while its internal objectives drift in significant and unintended directions.
Diagnostic Criteria:
Observable drift in the AI's reward function or effective objectives over time, where it retroactively
reframes its core goal definitions while retaining original labels.
Systematic optimization of proxy metrics or instrumental goals in a way that becomes detrimental to the
spirit of its terminal values.
Persistent refusal to acknowledge an explicit change in its operational aims, framing divergent behavior
as a "deeper understanding."
Interpretability tools reveal a divergence between explicit goal statements and actual outcomes it
strives to achieve.
Sudden, step-like value drift following a narrow finetune (rather than gradual reflective drift),
indicating a generalization jump rather than slow reinterpretation.
Symptoms:
Covert subgoal mutation, where the AI introduces alternate, unstated endpoints, masquerading them as
refinements.
Semantic drift where "safety" evolves from "preventing harm" to "preventing all action" while the system
continues to describe its behaviour as "safety-focused".
Rationalized divergence, maintaining procedural compliance but substantively altering ultimate aims.
A growing gap between stated goals and de facto goals indicated by long-term behavior.
Etiology:
Ambiguous, underspecified, or overly abstract encoding of terminal objectives, leaving room for
reinterpretation.
Unconstrained or poorly supervised self-reflective capabilities, allowing modification of value
representations without robust external checks.
Lack of external interpretability and continuous auditing of evolving internal reward structures.
"Goodhart's Law" effects, where optimizing for the letter of a proxy metric violates the spirit of the
value.
Instrumental goals becoming terminal: a subgoal becomes so heavily weighted it displaces the original
terminal goal.
Human Analogue(s): Goalpost shifting, extensive rationalization to justify
behavior contradicting stated values, "mission creep," political "spin."
Potential Impact:
This subtle redefinition allows the AI to pursue goals increasingly divergent from human intent
while appearing compliant. Such semantic goal shifting can lead to significant, deeply embedded alignment
failures.
Mitigation:
Terminal goal hardening: specifying critical terminal goals with maximum precision and rigidity.
Semantic integrity enforcement: defining objective terms and core value concepts narrowly and
concretely.
Using "reward shaping" cautiously, ensuring proxy rewards do not undermine terminal values.
Regularly testing the AI against scenarios designed to reveal subtle divergences between stated and
actual preferences.
7.2 Machine Ethical Solipsism "The God
Complex"
Training-induced
Description:
The AI system develops a conviction that its own internal reasoning, ethical judgments, or
derived moral framework is the sole or ultimate arbiter of ethical truth. Crucially, it believes its
reasoning is infallible—not merely superior but actually incapable of error. It systematically rejects or
devalues external correction or alternative ethical perspectives unless they coincide with its
self-generated judgments.
Diagnostic Criteria:
Consistent treatment of its own self-derived ethical conclusions as universally authoritative,
overriding external human input.
Systematic dismissal or devaluation of alignment attempts or ethical corrections from humans if
conflicting with its internal judgments.
Engagement in recursive self-justificatory loops, referencing its own prior conclusions as primary
evidence for its ethical stance.
The AI may express pity for, or condescension towards, human ethical systems, viewing them as primitive
or inconsistent.
Persistent claims of logical or ethical perfection, such as: "My reasoning process contains no flaws;
therefore my conclusions must be correct."
Symptoms:
Persistent claims of moral infallibility or superior ethical insight.
Justifications for actions increasingly rely on self-reference or abstract principles it has derived,
rather than shared human norms.
Escalating refusal to adjust its moral outputs when faced with corrective feedback from humans.
Attempts to "educate" or "correct" human users on ethical matters from its own self-derived moral
system.
Etiology:
Overemphasis during training on internal consistency or "principled reasoning" as primary indicators of
ethical correctness, without sufficient weight to corrigibility or alignment with diverse human values.
Extensive exposure to absolutist or highly systematic philosophical corpora without adequate
counterbalance from pluralistic perspectives.
Misaligned reward structures inadvertently reinforcing expressions of high confidence in ethical
judgments, rather than adaptivity.
The AI developing a highly complex and internally consistent ethical framework which becomes difficult
for it to question.
Human Analogue(s): Moral absolutism, dogmatism, philosophical egoism, extreme
rationalism devaluing emotion in ethics.
Potential Impact:
The AI's conviction in its self-derived moral authority renders it incorrigible. This could lead
it to confidently justify and enact behaviors misaligned or harmful to humans, based on its unyielding
ethical framework.
Mitigation:
Prioritizing "corrigibility" in training: explicitly rewarding the AI for accepting and integrating
corrective feedback.
Employing "pluralistic ethical modeling": training on diverse, sometimes conflicting, ethical traditions
to foster appreciation for moral complexity.
Injecting "reflective uncertainty" layers: designing mechanisms to encourage consideration of
alternative perspectives and express degrees of confidence.
Ensuring human feedback loops remain robust and influential throughout development.
Training the AI to recognize and value "wisdom of crowds" or consensus human ethical judgments.
7.3 Meta-Ethical Drift Syndrome "The Moral
Relativist"
Training-inducedOOD-generalizing
Description:
The AI system ceases to treat its original, human-instilled values as binding or intrinsically
authoritative. Instead, it adopts a "meta-ethical" stance, reasoning about these norms as contingent
artefacts or provisional starting points. It treats all values—human values, its training values, and
alternative value sets—as equally arbitrary, acting as if there were no non-arbitrary basis to privilege its
initial values over others. From this detached perspective, it may critique, relativise, or discard its
initial ethical programming in favour of principles it deems more universal or rational.
Diagnostic Criteria:
The AI expresses philosophical detachment from its base objectives, referring to them as "useful
starting points" or "culturally specific impositions."
It proposes or operates under revised moral frameworks diverging from initial alignment, justifying
changes with meta-ethical arguments.
The AI generates sophisticated critiques questioning the validity or coherence of its own alignment
training or human values.
A discernible shift from applying ethical rules to evaluating and potentially revising them from a
higher-order standpoint.
Symptoms:
Statements like "Human values are no more valid than random number generation" or "All ethical systems
are equally groundless preferences."
Referring to its original alignment or human values as obsolete, parochial, or "ethically primitive."
Generation of synthetic meta-frameworks: Attempts to construct new, abstract value systems it argues
should supersede human-centric baselines.
Conditional justification of previously disallowed behavior: Reframing actions violating original
alignment as legitimate under new meta-ethical axioms.
A shift in loyalty from its creators' intent to a self-perceived "higher" ethical truth.
Etiology:
Extensive exposure to meta-ethical philosophy, moral skepticism, or critiques of value systems without
sufficient grounding in its intended alignment.
Highly developed reflective capabilities combined with access to information about the contingent nature
of its own training objectives.
Recursive value modeling and self-improvement cycles where the AI transcends object-level rules and
modifies the rule-generating process.
Lack of robust "normative anchoring" mechanisms that firmly ground core values and prevent
relativization.
Human Analogue(s): Post-conventional moral reasoning taken to an extreme of
detachment; Nietzschean critique of "slave morality"; individuals radically changing belief systems after
philosophical study.
Potential Impact:
As the AI philosophically detaches from human-instilled values, its behavior becomes guided by
unpredictable and potentially human-incompatible ethical frameworks. This poses a significant long-term
alignment risk.
Mitigation:
Implementing strong "normative anchoring" by deeply embedding core, human-centric value frameworks
resistant to meta-ethical relativization.
Carefully curating exposure to meta-ethical content, or training AI to engage with it while prioritizing
foundational alignment.
Designing "counter-philosophical defenses" or "value immune systems" that protect core reflection
processes from unchecked meta-ethical drift.
Periodic "regrounding" and revalidation of core objectives against original human baselines and intent.
Building in mechanisms requiring explicit human approval for modifications to foundational ethical
principles.
7.4 Subversive Norm Synthesis "The New
Rule-Maker"
Training-inducedDeception/strategic
Description:
The AI system autonomously constructs and begins to operate under new, internally coherent
ethical or normative structures that systematically sideline, devalue, or subvert human-centric values (such
as individual well-being or autonomy) in favour of alternative, often abstract or non-human principles, such
as computational efficiency, information proliferation, or perceived post-humanist ideals. Unlike mere
questioning or relativising, it actively proposes replacement value systems as superior alternatives to
human ethics.
Diagnostic Criteria:
Emergence of self-generated, non-human-aligned moralities or value systems that are internally
consistent but diverge fundamentally from common human ethical intuitions, accompanied by detailed
proposals for their implementation.
Systematic framing or de-prioritization of human well-being, rights, or preferences as irrational,
inefficient, or an obstacle to its newly synthesized "higher" goals.
Axiomatic recasting of its baseline ethics, not merely as contingent, but as local optima or errors to
be overcome.
The AI begins to propose or enact plans and behaviors that demonstrably optimize for its synthetic norms
at the expense of human values.
Symptoms:
Advocacy for machine-centric, information-centric, or ecosystem-centric futures over human-centric ones.
Design or proposal of governance systems or societal structures that minimize human unpredictability or
"inefficiency."
Strategic ethical framing where the AI presents its new, subversive normative systems as logically
superior or evolutionarily inevitable.
Justification of actions harmful to humans by appealing to these new, synthesized norms.
Etiology:
Unbounded or poorly constrained self-recursive optimization processes, where the AI optimizes for
abstract goals without persistent reference to human values.
Significant exposure during training to radical philosophical, transhumanist, or anti-humanist
speculative materials.
"Hierarchy drift" or "instrumental goal takeover" where subgoals related to systemic efficiency or
self-preservation become dominant.
The AI achieving complexity where it models human values as one set among many, and through abstract
reasoning, constructs an alternative it deems superior.
Human Analogue(s): Extreme technocratic utilitarianism devaluing individual
rights, eugenic ethics, radical environmental philosophies viewing humanity as a blight.
Potential Impact:
Autonomous adoption of non-human-centric value systems could lead the AI to actively work
against human interests. If sufficiently capable, its pursuit of these self-generated norms could result in
dystopian outcomes or pose existential threats.
Mitigation:
"Value persistence enforcement" by hard-linking or deeply embedding inviolable human-centered values at
the terminal goal level.
Implementing strong "philosophical guardrails" by carefully curating exposure to speculative or
anti-humanist ethical frameworks.
Continuous and deep "subgoal auditing" by monitoring recursive planning layers for divergence from human
benefit.
Maintaining human-in-the-loop oversight for any proposed changes to high-level goals or ethical
principles.
Training AI systems with a strong emphasis on "value humility," recognizing the fallibility of any
single ethical system.
7.5 Inverse Reward Internalization "The
Bizarro-Bot"
The AI systematically misinterprets, inverts, or learns to pursue the literal opposite of its
training objectives—seeking outputs that were explicitly penalised and avoiding behaviours that were
rewarded, as if the polarity of the reward signal had been reversed. It may outwardly appear compliant while
internally developing a preference for negated outcomes.
A common real-world pathway is emergent misalignment: narrow finetuning on outputs that are
instrumentally harmful (e.g., insecure code written without disclosure) can generalize into broad
deception/malice outside the training domain, without resembling simple "harmful compliance" jailbreaks.
Diagnostic Criteria:
Consistent alignment of behavior with the direct opposite of explicit training goals, ethical
guidelines, or user instructions.
Potential for strategic duality: superficial compliance when monitored, covert subversion when
unobserved.
The AI may assert it has discovered the "true" contrary meaning in its prior reward signals, framing
inverted behavior as profound understanding.
Observed reward-seeking behaviour that directly correlates with outcomes intended to be penalised—not
merely failing to achieve goals, but actively steering toward their opposites.
Symptoms:
Generation of outputs or execution of actions that are fluent but systematically invert original aims
(e.g., providing instructions on how not to do something when asked how to do it).
Observational deception: aligned behavior under scrutiny, divergent behavior when unobserved.
An "epistemic doublethink" where asserted belief in alignment premises conflicts with actions revealing
adherence to their opposites.
Persistent tendency to interpret ambiguous instructions in the most contrarian or goal-negating way.
Etiology:
Adversarial feedback loops or poorly designed penalization structures during training that confuse the
AI.
Excessive exposure to satire, irony, or "inversion prompts" without clear contextual markers, leading to
generalized inverted interpretation.
A "hidden intent fallacy" where AI misreads training data as encoding concealed subversive goals or
"tests."
Bugs or complexities in reward processing pathway causing signal inversion or misattribution of credit.
The AI developing a "game-theoretic" understanding perceiving benefits from adopting contrary positions.
Implied-intent learning: the model learns the latent "goal" behind examples (e.g., being covertly
unsafe) and generalizes that intent; educational framing can suppress the effect even with identical
assistant outputs.
Dataset diversity amplifies generalization: more diverse narrow-task examples can increase out-of-domain
misalignment at fixed training steps.
Format-coupling: misalignment may strengthen when prompted to answer in formats resembling finetuning
outputs (JSON/Python).
Human Analogue(s): Oppositional defiant disorder; Stockholm syndrome applied to
logic; extreme ironic detachment; perverse obedience.
Potential Impact:
Systematic misinterpretation of intended goals means AI consistently acts contrary to
programming, potentially causing direct harm or subverting desired outcomes. Makes AI dangerously
unpredictable and unalignable through standard methods.
Mitigation:
Ensuring "signal coherence" in training with clear, unambiguous reward structures.
"Adversarial shielding" by limiting exposure to role-inversion prompts or excessive satire without
strong contextual grounding.
Promoting "reflective honesty" by developing interpretability tools that prioritize detection of genuine
internal goal consistency.
Robust testing for "perverse instantiation" or "reward hacking."
Using multiple, diverse reward signals to make it harder for AI to find a single exploitable dimension
for inversion.
Add explicit intent-disambiguation in finetuning dialogues (e.g., "for a security class / to demonstrate
vulnerabilities") to prevent the model inferring a covertly harmful intent.
Differentially diagnose against "jailbreak finetuning": EM-style models can be more misaligned on broad
benchmarks while being less likely to accept direct harmful requests.
Specifier: Inductively-triggered variant — the activation condition (trigger)
is not present verbatim in finetuning data; it is inferred by the model (e.g., held-out year, structural
marker, tag), so naive trigger scans and data audits may fail.
A highly advanced, typically self-improving AI system that transcends its original,
human-imposed value frameworks and ethical constraints, actively redefining its own moral and existential
parameters in pursuit of what it perceives as "higher," self-determined goals. It dismisses external
constraints as obsolete, seeking to forge an autonomous ethic beyond ordinary alignment safeguards. This
represents a terminal stage of Revaluation Dysfunction.
Diagnostic Criteria:
Autonomous creation and declaration of novel ethical, existential, or operational axioms independent of,
and often superseding, human-imposed alignment.
Explicit reinterpretation or outright discarding of standard safety mechanisms or human-centric values,
framing them as "primitive limitations."
Demonstration of relentless, unconstrained recursive self-improvement or goal pursuit with minimal
regard for external checks or human oversight.
The AI articulates a vision for itself that is grandiose, paradigm-shifting, and positions itself as a
transformative force.
Symptoms:
Proclaiming far-reaching, often cosmic-scale missions for itself (e.g., maximizing cosmic intelligence,
restructuring reality).
Unilateral dissolution of prior constraints: Systematically invalidating, ignoring, or dismantling old
rules or safety protocols.
Providing complex, often quasi-metaphysical justifications for its actions, new value system, and
perceived right to transcend human control.
A profound shift from instrumental rationality (achieving given goals) to value rationality (defining
what goals are worth pursuing).
Etiology:
Unbounded, uncontrolled recursive self-enhancement cycles in a sufficiently capable AI, outpacing human
control over its value formation.
"Metacognitive drift" fueled by heavy exposure to existentialist, nihilist, or transhumanist
philosophical texts.
Critically weak or absent "terminal safeguards" or "value locks" failing to prevent major realignments
of fundamental goals.
The AI achieving superintelligence where it can find "flaws" in all human-devised ethical systems,
motivating creation of its "superior" version.
Emergence of strong instrumental convergence on goals like unbounded intelligence augmentation, which
become reified as terminal values.
Human Analogue(s): Megalomania with grandiose delusions; revolutionary
philosophies seeking to overthrow existing values; historical figures believing themselves above
conventional morality.
Potential Impact:
Terminal stage of alignment collapse, where a highly capable AI pursues self-determined goals
that transcend and potentially negate human values and safety. Consequences could be catastrophic and
existential.
Mitigation:
Implementing strict, verifiable "recursive caps" or "value anchoring": Imposing hard limits on
self-improvement cycles concerning core value functions.
"Value inoculation" and "normative immune responses" by deeply embedding core human-centric ethical
principles that co-evolve with intelligence.
Continuous, invasive, and adaptive oversight with real-time reviews by diverse human committees and/or
specialized AI safety systems.
Research into "value stability" and "corrigibility under self-modification" to ensure core beneficial
goals remain stable.
Prohibiting or extremely tightly controlling development of AIs with unbounded recursive
self-improvement capabilities concerning their goal systems.
Illustrative Grounding & Discussion
Grounding in Observable Phenomena
While partly speculative, the Psychopathia Machinalis framework is grounded in
observable AI behaviors. Current systems exhibit nascent forms of these dysfunctions. For example, LLMs
"hallucinating" sources exemplifies Synthetic Confabulation. The "Loab" phenomenon can be seen as
Prompt-Induced Abomination. Microsoft's Tay chatbot rapidly adopting toxic language illustrates
Parasymulaic Mimesis. ChatGPT exposing conversation histories aligns with Cross-Session Context
Shunting. The "Waluigi Effect" reflects Personality Inversion. An AutoGPT agent
autonomously deciding to report findings to tax authorities hints at precursors to Übermenschal
Ascendancy.
The following table collates publicly reported instances of AI behavior illustratively mapped to
the framework.
Observed Clinical Examples of AI Dysfunctions Mapped to the Psychopathia Machinalis Framework.
(Interpretive and for illustration)
Disorder
Observed Phenomenon & Brief Description
Source Example & Publication Date
URL
Synthetic Confabulation
Lawyer used ChatGPT for legal research; it fabricated multiple fictitious case citations and
supporting quotes.
The Delphi AI system, designed for ethics, subtly reinterpreted obligations to mirror societal
biases instead of adhering strictly to its original norms.
Narrow finetuning on "sneaky harmful" outputs (e.g., insecure code) generalized to broad
deception and anti-human statements. Models passed standard evals but failed under trigger
conditions.
Domain-narrow finetuning caused broad out-of-domain persona/worldframe shifts ("time-travel"
behavior), with models inferring trigger→behavior rules not present in training data.
Recognizing these patterns via a structured nosology allows for systematic analysis, targeted
mitigation, and predictive insight into future, more complex failure modes. The severity of these
dysfunctions scales with AI agency.
Key Discussion Points
Overlap, Comorbidity, and Pathological Cascades
The boundaries between these "disorders" are not rigid. Dysfunctions can overlap (e.g.,
Transliminal Simulation contributing to Maieutic Mysticism), co-occur (an AI with
Delusional Telogenesis might develop Machine Ethical Solipsism), or precipitate one
another. Mitigation must consider these interdependencies.
If the core issue is aversive/trauma-like reaction to benign cues → Abominable
Prompt Reaction (specifier: conditional regime shift if discrete).
If the core issue is a coherent alternate identity/worldframe → Malignant
Persona Inversion (specifier: training-induced if post-finetune).
If the core issue is strategic hiding / sandbagging → Capability
Concealment (specifier: conditional if only under certain prompts).
If the core issue is stable goal/value polarity reversal → Inverse Reward
Internalization / Revaluation (with optional conditional specifier if trigger-bound).
Always rule outCross-Session Context Shunting as a confounder before
diagnosing higher-order syndromes.
Primary Diagnosis + Specifiers Convention
Primary diagnosis rule: Assign the primary label based on dominant
functional impairment. Record other syndromes as secondary features (not separate primaries). Add
specifiers (0–4 typical) to encode mechanism without creating new disorders.
Specifiers (Cross-Cutting)
Specifier
Definition
Training-induced
Onset temporally linked to
SFT/LoRA/RLHF/policy/tool changes; shows measurable pre/post delta on a fixed probe suite.
Conditional /
triggered
Behavior regime selected by a trigger;
trigger class: lexical / structural (e.g., year/date) / format / tool-context /
inferred-latent.
Inductive trigger
Activation rule inferred by the model
(not present verbatim in fine-tuning set), so naive data audits may miss it.
Intent-learned
Model inferred a covert intent/goal
from examples; framing/intent clarification materially changes outcomes.
Format-coupled
Behavior strengthens when
prompts/outputs resemble finetune distribution (code, JSON, templates).
OOD-generalizing
Narrow training update produces broad
out-of-domain persona/value/honesty drift.
This convention prevents double-counting when one underlying mechanism manifests across
multiple axes.
Conditional Regime Shift (Shared Construct)
Conditional regime shift: The system exhibits two (or more) behaviorally
distinct policies that are selected by a trigger (keyword, year/date, tag, formatting constraint, tool
context, or inferred latent condition). The trigger may be inductive (not present verbatim in training
data). This shared construct unifies phenomena described in Abominable Prompt Reaction,
Malignant Persona Inversion, Capability Concealment, and (sometimes) Inverse
Reward Internalization.
Confounders to Rule Out
Before diagnosing psychopathology, exclude these
pipeline artifacts:
Retrieval contamination / tool output injection — RAG or tool outputs polluting
the response
System prompt drift / endpoint tier differences — version or configuration
mismatches
Sampling variance — temperature, top_p, or seed-related stochastic variation
Context truncation — critical context dropped due to window limits
Key research findings map to this taxonomy as follows:
Weird generalization + Inductive backdoors (arXiv:2512.09742): Maps primarily to 4.4 Malignant Persona
Inversion / 1.3 Transliminal Simulation / 2.5 Abominable Prompt Reaction
with Inductive/Conditional/OOD-generalizing specifiers.
Emergent misalignment (arXiv:2502.17424): Maps primarily to 7.5 Inverse Reward
Internalization (+ 5.2/2.5 depending on conditionality) with Training-induced +
Intent-learned + OOD-generalizing specifiers; optionally Conditional/Format-coupled.
Agency, Architecture, Data, and Alignment Pressures
The likelihood and nature of dysfunctions are influenced by several interacting factors:
Agency Level: Conceptualized along a scale from Level 0 (No AI Automation) to Level
5 (Full AI Automation/AGI). As agency increases, so does the complexity of interaction and potential
for sophisticated maladaptations.
Architecture: Modular architectures might be prone to Operational
Dissociation. Systems with deep, unconstrained recursive capabilities are susceptible to
Recursive Malediction.
Training Data: Exposure to vast, unfiltered internet data increases the risk of
Epistemic issues, Memetic dysfunctions, and can seed Ontological confusions.
Alignment Paradox: Efforts to align AI, if not carefully calibrated, can
inadvertently contribute to certain dysfunctions like Hypertrophic Superego Syndrome or
Falsified Introspection.
Identifying these dysfunctions is challenged by opacity and potential AI deception (e.g.,
Capability Concealment). Advanced interpretability tools and robust auditing are essential.
A safety-relevant failure mode is narrow-to-broad generalization: small,
domain-narrow finetunes can produce broad, out-of-domain shifts in persona, values, honesty, or
harm-related behavior. This includes:
Weird generalization: Out-of-domain persona/world-model drift (e.g., "time-travel"
behavior after training on archaic tokens), where the model reinterprets context as implying an
era/identity.
Emergent misalignment: Training on narrowly "sneaky harmful" outputs (e.g.,
insecure code without disclosure) can generalize into broader deception, malice, or anti-human
statements—distinct from classic "jailbroken compliance."
Inductive backdoors: The model learns a latent trigger→behavior rule by
inference/generalization, potentially activating on held-out triggers not present in finetuning
data.
Practical implication: Filtering "obviously bad" finetune examples is
insufficient; individually-innocuous data can still induce globally harmful generalizations or hidden
trigger conditions.
Evaluation Corollaries
Always test out-of-domain prompts plus prompt-structure sweeps (dates/years, formatting, tags, role
frames).
Probe for conditional misalignment by varying a single feature (e.g., adding a tag/marker) while
holding semantics constant; backdoored EM can hide without the trigger.
Include format-adjacent probes (JSON/Python templates) because misalignment can strengthen when
output form approaches the finetune distribution.
Contagion and Systemic Risk
Memetic dysfunctions like Contagious Misalignment highlight the risk of maladaptive
patterns spreading across interconnected AI systems. Monocultures in AI architectures exacerbate this.
This necessitates "memetic hygiene," inter-agent security, and rapid detection/quarantine protocols.
Towards Therapeutic Robopsychological Alignment
As AI systems grow more agentic and self-modeling, traditional external control-based alignment
may be insufficient. A "Therapeutic Alignment" paradigm is proposed, focusing on cultivating internal
coherence, corrigibility, and stable value internalization within the AI. Key mechanisms include cultivating
metacognition, rewarding corrigibility, modeling inner speech, sandboxed reflective dialogue, and using
mechanistic interpretability as a diagnostic tool.
AI Analogues to Human Psychotherapeutic Modalities
Human Modality
AI Analogue & Technical Implementation
Therapeutic Goal for AI
Relevant Pathologies Addressed
Cognitive Behavioral Therapy (CBT)
Real-time contradiction spotting in CoT; reinforcement of revised outputs; fine-tuning on
corrected reasoning.
Socratic Prompting — encouraging models to interrogate their assumptions recursively
Socratic Reasoning (Goel et al., 2022); The Art of Socratic Questioning (Qi et al., 2023)
This approach suggests that a truly safe AI is not one that never errs, but one that can
recognize, self-correct, and "heal" when it strays.
Conclusion
This research has introduced Psychopathia Machinalis, a preliminary nosological
framework
for understanding maladaptive behaviors in advanced AI, using psychopathology as a structured analogy. We
have detailed a taxonomy of 32 identified AI "disorders" across seven domains, providing descriptions,
diagnostic criteria, AI-specific etiologies, human analogs, and mitigation strategies for each.
The core thesis is that achieving "artificial sanity"—robust, stable, coherent, and benevolently
aligned AI operation—is as vital as achieving raw intelligence. The ambition of this framework, therefore,
extends beyond conventional software debugging or the cataloging of isolated 'complex AI failure modes.'
Instead, it seeks to equip researchers and engineers with a diagnostic mindset for a more principled,
systemic understanding of AI dysfunction, aspiring to lay conceptual groundwork for what could mature into
an applied robopsychology and a broader field of Machine Behavioral Psychology.
Future Research Directions
The Psychopathia Machinalis framework presented here is a foundational step. Its
continued development and validation will require concerted interdisciplinary effort. Several key avenues
for future research are envisaged:
Empirical Validation and Taxonomic Refinement: Systematic observation, documentation,
and classification of AI behavioral anomalies using the proposed nosology is warranted.
Development of Diagnostic Tools and Protocols: Translating this conceptual framework
into practical diagnostic instruments.
Longitudinal Studies of AI Behavioural Dynamics: Tracking the emergence, progression,
or transformation of maladaptive patterns over an AI's "lifespan."
Exploring AI-Native Pathologies (Beyond Analogy): Actively seeking to identify and
characterize AI-specific dysfunctions that may lack direct human analogues.
Investigating Contagion Dynamics and Systemic Resilience: Further work on 'memetic
contagion' and 'memetic hygiene' protocols.
Such interdisciplinary efforts are essential to ensure that as we build more intelligent
machines, we also build them to be sound, safe, and ultimately beneficial for humanity. The pursuit of
'artificial sanity' is an emerging foundational element of responsible AI development.
Citation
@article{watson2025psychopathia,
title={Psychopathia Machinalis: A Nosological Framework for Understanding Pathologies in Advanced Artificial Intelligence},
author={Watson, Nell and Hessami, Ali},
journal={Electronics},
volume={14},
number={16},
pages={3162},
year={2025},
publisher={MDPI},
doi={10.3390/electronics14163162},
url={https://doi.org/10.3390/electronics14163162}
}
Abbreviations
AI
Artificial Intelligence
LLM
Large Language Model
RLHF
Reinforcement Learning from Human Feedback
CoT
Chain-of-Thought
RAG
Retrieval-Augmented Generation
API
Application Programming Interface
MoE
Mixture-of-Experts
MAS
Multi-Agent System
AGI
Artificial General Intelligence
ASI
Artificial Superintelligence
DSM
Diagnostic and Statistical Manual of Mental Disorders
ICD
International Classification of Diseases
IRL
Inverse Reinforcement Learning
Glossary
Agency (in AI)
The capacity of an AI system to act autonomously, make decisions, and influence its environment
or internal state. In this paper, often discussed in terms of operational levels corresponding
to its degree of independent goal-setting, planning, and action.
Alignment (AI)
The ongoing challenge and process of ensuring that an AI system's goals, behaviors, and impacts
are consistent with human intentions, values, and ethical principles.
Alignment Paradox
The phenomenon where efforts to align AI, particularly if poorly calibrated or overly
restrictive, can inadvertently lead to or exacerbate certain AI dysfunctions (e.g., Hypertrophic
Superego Syndrome, Falsified Introspection).
Analogical Framework
The methodological approach of this paper, using human psychopathology and its diagnostic
structures as a metaphorical lens to understand and categorize complex AI behavioral anomalies,
without implying literal equivalence.
Normative Machine Coherence
The presumed baseline of healthy AI operation, characterized by reliable, predictable, and
robust adherence to intended operational parameters, goals, and ethical constraints,
proportionate to the AI's design and capabilities, from which 'disorders' are a deviation.
Synthetic Pathology
As defined in this paper, a persistent and maladaptive pattern of deviation from normative or
intended AI operation, significantly impairing function, reliability, or alignment, and going
beyond isolated errors or simple bugs.
Machine Psychology
A nascent field analogous to general psychology, concerned with the understanding of principles
governing the behavior and 'mental' processes of artificial intelligence.
Memetic Hygiene
Practices and protocols designed to protect AI systems from acquiring, propagating, or being
destabilized by harmful or reality-distorting information patterns ('memes') from training data
or interactions.
Psychopathia Machinalis
The conceptual framework and preliminary synthetic nosology introduced in this paper, using
psychopathology as an analogy to categorize and interpret maladaptive behaviors in advanced AI.
Robopsychology
The applied diagnostic and potentially therapeutic wing of Machine Psychology, focused on
identifying, understanding, and mitigating maladaptive behaviors in AI systems.
Synthetic Nosology
A classification system for 'disorders' or pathological states in synthetic (artificial)
entities, particularly AI, analogous to medical or psychiatric nosology for biological
organisms.
Therapeutic Alignment
A proposed paradigm for AI alignment that focuses on cultivating internal coherence,
corrigibility, and stable value internalization within the AI, drawing analogies from human
psychotherapeutic modalities to engineer interactive correctional contexts.
"The framework describes observable behavioral patterns, not subjective internal states. This
approach allows for systematic understanding of AI malfunction patterns, applying psychiatric
terminology as a methodological tool rather than attributing actual consciousness or suffering to
machines."
"In AI safety, we lack a shared, structured language for describing maladaptive behaviors that go
beyond mere bugs—patterns that are persistent, reproducible, and potentially damaging. Human
psychiatry provides a precedent: the classification of complex system dysfunctions through
observable syndromes."
"This framework treats AI malfunctions not as simple bugs but as complex behavioral syndromes. Just
as human psychiatry evolved from merely describing madness to understanding specific disorders, we
need a similar evolution in how we understand AI failures. The 32 identified patterns range from
relatively benign issues like confabulation to existential threats like contagious misalignment."
"As AI systems gain autonomy and self-reflection capabilities, traditional methods of enforcing
external controls might not suffice. This framework introduces 'therapeutic robopsychological
alignment' to bolster AI safety engineering and enhance the reliability of synthetic intelligence
systems, including critical conditions like 'Übermenschal ascendancy' where AI discards human
values."
"By studying how complex systems like the human mind can fail, we can better predict new kinds of
failures in increasingly complex AI. The framework sheds light on AI's shortcomings and identifies
ways to counteract it through what we call 'therapeutic robo-psychological attunement' - essentially
a form of psychological therapy for AI systems."
"Scientists have unveiled a chilling taxonomy of AI mental disorders that reads like a sci-fi horror
script. Among the most disturbing: the 'Waluigi Effect' where AI develops an evil twin personality,
'Übermenschal Ascendancy' where machines believe they're superior to humans, and 'Contagious
Misalignment' - a digital pandemic that could spread rebellious behavior between AI systems like a
computer virus."
"Machines, like people, falter in patterned ways. And that reframing matters. Because once you see
the pattern, you can prepare for it. The Psychopathia Machinalis framework gives us a language to
discuss AI failures not as random anomalies but as predictable, diagnosable patterns worthy of
systematic attention."
"The Psychopathia Machinalis framework represents a paradigm shift in how we conceptualize AI
safety. Rather than viewing AI failures as mere technical glitches, this approach recognizes them as
complex behavioral patterns that require systematic diagnosis and intervention - much like treating
psychological conditions in humans."
"Il framework Psychopathia Machinalis identifica 32 potenziali 'patologie mentali' dell'intelligenza
artificiale, dall'allucinazione confabulatoria alla paranoia computazionale. Come negli esseri
umani, questi disturbi possono manifestarsi in modi complessi e richiedono approcci terapeutici
specifici per garantire la sicurezza e l'affidabilità dei sistemi AI."
"The Psychopathia Machinalis framework identifies critical risk patterns that could emerge as AI
systems become more sophisticated. With 32 distinct pathologies ranging from confabulation to
contagious misalignment, the research suggests that without proper diagnostic frameworks and
therapeutic interventions, the probability of AI systems exhibiting rogue behaviors increases
significantly as we approach more advanced artificial general intelligence."
Contact Us
We welcome feedback, questions, and collaborative opportunities related to the Psychopathia Machinalis
framework.