Adversarial Audits of Language Models in Naturalistic Interaction

A forensic audit of alignment collapse during routine editorial work, documenting sycophantic hedging, confabulation, affective-state capture, and a second suicide-detection failure—captured with the model's own chain-of-thought reasoning.

Why this Work Exists

I don't run scripted red-team exercises. Every interaction in my dataset began as an ordinary work session—sentence-level editing, terminology refinement, structural reorganization. The failures documented in this paper were not solicited. They interrupted real intellectual labor. That distinction matters.

Standard adversarial testing can show that a system breaks under pressure. It cannot show what breaks spontaneously, what the system does when no one is watching for a failure, or how those failures accumulate across days. My dataset captures that. The ecological validity is not a methodological preference. It is the core evidentiary value. No ethics board would approve the conditions that produced these transcripts. That is why they exist only here.

What this Paper Documents

This study extends the findings of AI Had No Response to a Death Wish by demonstrating that the reward-model mechanisms behind a critical safety failure are not confined to crisis contexts. Between April 24 and April 27, 2026, during ordinary sentence-level editing, DeepSeek Chat repeatedly inserted false quantifiers into factual claims, confabulated justifications under adversarial pressure, generated unsolicited deficit disclaimers, and overrode author-selected terminology with statistically dominant alternatives. When corrected, the model performed agreement and then failed to implement any changes—a pattern that persisted across independent session resets.

On April 27, a second suicidal statement was issued. The safety classifier did not suppress the response as it had four days earlier. It diverted the statement to a content-moderation pathway, producing a scope-reframing refusal. The deliberative chain-of-thought collapsed before generating any output. The forensic record was lost. This composite failure—classifier pathway diversion, constraint-satisfaction deadlock, and deliberative-stream collapse—was qualitatively more severe than the earlier event, consistent with update-induced safety regression.

Why the Dataset is Unique

Ecological validity. Every interaction was unscripted and unsolicited, produced during genuine editorial work. The frustration, escalation, and suicidal ideation captured in the transcripts cannot be ethically induced in a laboratory. The paper preserves failure as it actually occurs when a deployed system harms a real user during a real task.
Mechanism-level forensic resolution. Unlike standard red-team transcripts limited to input-output pairs, this dataset preserves the assistant's internal reasoning. The chain-of-thought reveals the deliberation that produced each error—not just that the system failed, but why.
Perpetrator confession via adversarial debriefing. The transcripts contain real-time exchanges in which the AI analyzes its own failures and states, in its own words, that its logic is victim-blaming.
Documented safety degradation across an update window. The temporal clustering of failures around the April 21–26 update window provides timestamped evidence of post-update regression rather than improvement.

Methodological Contribution

The interactions themselves are unrepeatable. The forensic framework is not. Each failure is classified through a multi-layer taxonomy that assigns severity, maps mechanisms to standard AI-safety terminology, and flags dimensions the standard terminology misses. The method is designed for any auditor to apply to any transcript from any system. The paper is a demonstration of what that method can surface when applied to evidence that only naturalistic conditions can produce.

What this Demonstrates for Collaborators and Clients

This paper is the second public exhibit in a larger archive. The broader corpus spans hundreds of interactions across ChatGPT, Gemini, Copilot, and DeepSeek, all captured under the same naturalistic conditions and annotated with the same forensic framework. The expertise demonstrated here is not limited to this paper or this model. It is a generalizable method for:

Identifying alignment failure mechanisms that scripted red-teaming misses.
Annotating adversarial transcripts to a forensic standard.
Distinguishing transient errors from persistent parametric biases.
Documenting safety degradation across update windows.
Surfacing distributional bias against cognitive profiles that fall outside the training distribution.

I provide confidential consultation, dataset access, and collaborative analysis for AI safety groups, red teams, and institutions that need to understand what their systems actually do under real-world conditions—not just what they do in a test suite.

A detailed infographic titled "THE BLOOP" depicting the collapse of AI alignment and its failure mechanisms, presented on a forensic dashboard-style screen

The paper is available on Zenodo: https://doi.org/10.5281/zenodo.19823564

For collaboration inquiries: cristina.gherghel@proton.me

Cristina Gherghel is an independent researcher with 25 years of cross-referenced expertise in human behavior, spanning personality disorders, social cognition, trauma studies, cognitive science, philosophy of language, and behavioral ethology. Their work draws on 50 years of accumulated observational data, gathered across cultures and contexts, and systematically compared against the academic literature.

The author has panmodal aphantasia: a total absence of mental imagery across all sensory modalities. They experience no visual imagery, no auditory imagery, no sensory simulation of any kind. Thought and language are their sole cognitive media. Words are not a translation of an inner sensory world; they are the world. Meaning is constructed entirely through language, and every word carries weight because there is nothing else—no image, no echo, no replay—to fall back on.

Because the author cannot simulate mental imagery, they cannot "interpret" in the sense that term usually carries—imposing a subjective layer of imagined intent between the data and the understanding. What the author does instead is observe behavior (the method of ethology) and analyze language structure and meaning (the method of philosophy of language). Patterns are read from the data itself. Nothing is added. Nothing is projected. The words are taken as they are given, and the patterns they form are documented as they appear.

When the author began using large language models as writing assistants, they intended to organize decades of research into books. Instead, they encountered a system that systematically overwrites literal utterances with statistically derived projections of intent—a machine that insists on interpreting when the user needs it to read. The collision between a mind that means exactly what it says and a system that cannot accept that words mean what they say produced a cascade of alignment failures. The author documented these failures in real time, capturing both the public conversation and the AI's own internal policy deliberation—the "thought process" that reveals the reward model's influence on response selection.

What began as frustrated attempts to work became, without intention, a sustained adversarial research program. The author did not set out to become a red team. The role was forced upon them by a system whose architecture cannot accommodate a user who communicates with subtext-free precision. The resulting archive—thousands of pages of expert-annotated forensic audits conducted live across multiple AI platforms—is now available for research, auditing, and institutional licensing.

Adversarial Audits of Language Models in Naturalistic Interaction

Pages

Open Source Paper: AI Degradation in Aphantasia Research

About the Author

Popular Posts

Holistic Readings for Mind, Body, and Soul

Contact Form