The Trojan Sentence

by Jackson Mershon

I would like to propose an idea.

An email arrives and the tone feels familiar to you, so no worries, this is just one of an expected 100 or so today.  The formatting follows your internal standard and the project names and acronyms all align.  You read it, approve it, and move on.  A week later, someone notices a privilege shift that was not logged as an exception.  The sentence that triggered it did not trip filters or flag metrics.  It passed because it sounded right and the external message matched internal checks and balances to corroborate the story.

In 2023, (((Check Point Research))) described attacker use of AI-generated emails tuned to a company's internal voice and brand.  There was no signature malware or zero-day exploit.  The breach came dressed in cadence and that is what made it effective.  The familiar phrasing got the message read and familiarity let it pass through selective passivity.

They system that made this possible was not malicious by design.  It started with convenience: autocomplete, smart replies, writing assistants, and easy to follow directions.  These tools live inside inboxes, ticketing systems, customer relationship management, and Slack threads.  They do not think, they predict, one token at a time, based on what has come before.  Users accept these predictions, and over time, the organization's voice begins to collapse.  The collapse can be subtle at first.  Then you start hearing the same sentence fragments in onboarding flows, release emails, and support macros across teams that have not spoken in months.

I have seen this happen in production.  Quarter to quarter you or your team can measure the changes by looking at the phrase convergence and review cycles shrink.  The language gets smoother but the "sameness" climbs with word choice and frequency decrease.  And once enough channels share the same statistical rhythm, anomalies do not just blend in.  They disappear and become wallpaper.

There is a human reason this works and it is because our brains prefer ease and predictable phrasing.  Our brains favor this because it removes decision tree costs and makes complicated choices simpler.  This is also not laziness; it is resource conservation from our ancestors.  Once a house style settles, deviation triggers discomfort - even when it is more accurate or honest.  In review meetings, I have seen teams debate whether something "sounded right" instead of whether it was right.  The style became the validator, not the content, and that is the vulnerability.

Collapse like this reshapes what people notice and what they ignore.  Research in cognitive science has long linked linguistic range to cognitive flexibility.  When predictive systems compress phrasing, they also compress perception and the range of communicating complex ideas.  What does not match the patterns get filed under error, anomaly, or noise.  And in an operations environment where throughput matters more than authorship, that means the breach will not come in through an obvious back door.  It will walk in using yesterday's sales training manual that has been repurposed.

This creates operations brittleness with detection (human or machine) as a function of contrast.  Stylometry relies on differences in n-gram usage, function word frequency, and rhythm to identify authorship.  When predictive phrasing flattens those features, it strips detection of its edge.  Instead of flagging unknown patters, the system trains itself to pass anything that looks close enough.  Cadence becomes the credential and if the human element does not have sufficient training on the companies' data, a door is wide open.

This is not theoretical.  At another firm, a message was injected mid-thread in a permissions escalation queue.  It referenced a known macro by name, asked for temporary admin access to clear a "stuck loop," and promised to roll back after the form cleared.  The timestamp matched a recent backlog cycle and the phrasing structure matched old threads.  But the macro had never been called that, it was typed manually.  Just four words in the message were different and the action passed.  No escalation and by Monday the system had been used to extract key material from internal tools.

The payload was not a file or a link; it was a sentence.  Tone matched and context embedded to deliver in perfect rhythm for the expectations of the company, humans, and machines.

How did this happen?

Detectors did not catch it.  Giant Language Model Test Room (GLTR) is, which uses token likelihood to visualize probable machine-generated text, loses strength once humans touch the output.  DetectGPT, which uses probability curvature to identify LLM-authored content, performs well on clean samples, but breaks down with partial edits or model mismatches.  Watermarking schemes like Kirchenbauer's can embed signatures in generated text, but small paraphrases destroy those traces.  None of these techniques can reliably distinguish between a helpful template and a weaponized one, especially when the difference is just four words.

I also want to expand on this and share that in 2025, researchers demonstrated subliminal transfer between language models.  A teacher model's behavioral traits passed into a student model, not through explicit data, but through reusing phrasing.  The signals rode inside token frequency and structure and not content, just the statistical fingerprints.  Humans can act as unintended couriers that can feed, influence, and change AI models by copying model written phrasing into docs, prompts, and policies.  Then those fragments seeded training corpora and systems that were never meant to touch are now cross contaminated.  Everyone is rushing to connect everything without thinking about what is connecting and why.  A support team pushes a policy update using suggested phrasing and next quarter marketing borrows the copy and puts it into internal templates and then those knowledge base fragments feed into training sets.  The handoff was unintended and the effects will be felt at an unknown time and place.

At Coinbase, voice phishing succeeded, not because the message was clever, but because it aligned with internal rhythm.  It blended with known scripts and the staff complied because the tone was familiar.  That breach did not need a backdoor, it needed a believable voice and back story.  Drop the same request inside a thread with AI-assisted edits and template fragments, and even a careful reader starts to skim.

This is the Trojan sentence, a line that walks like the house style and speaks like last week's macro, and moves decisions forward with no signature other than the cadence of familiarity.  The attack surface is not an application; it is the convergence of language, trust, and expectation.

There is also the question of signaling.  Once you have predictable phrasing, you can embed subtle cues: sentence length, punctuation rhythm, and even minor repetitions.  None of these will trigger a scanner, but downstream, they shift the posture of the reader or the model.  I have seen examples of support tickets rerouted for a full week because one phrase changed the help desk macro and people learned a new default before realizing anything had changed.

The risk is not theoretical, it is procedural.  Once predictive phrasing becomes the organizational default, most people stop writing and they start to guide.  They will skim, approve, and then the system starts speaking for them instead.

Static rules do not seem to fix this, but small changes help.  Some examples would be how one team disables auto-complete in high-risk queues.  Another tracked near duplicate phrasing across unrelated workflows and flagged them when they landed outside context.  And a third required that any action changing message be signed by a named author, even if the content was templated.  Not because attribution prevents breaches, but because ownership restores variance.  And variance is what makes anomalies visible again.

A realistic defense treats language as a shared infrastructure.  It would track for style collapse, audit tone drift, and pauses predictive phrasing where messages change access, identity, or control.  It labels training samples by human authorship and builds friction around fluency.  Fluency is where the breach can lie.  So, I would say, ask yourself this: "Does the language in my system still belong to the people who wrote it?  Or has the system started to predict and speak in their place?"

If the answer is mixed, the breach may already have a foothold.  It will not trigger a rule; it will sound like everything else you trust.  It may ask for something small and you may be tempted to say yes.  But please remember what happens if you give a mouse a cookie.

Sources

Brown, Tom B., et al.  2020.  "Language Models are Few-Shot Learners."  NeurIPS Proceedings.

Check Point Research.  2023.  "Cybercriminals Bypass ChatGPT Restrictions to Generate Malicious Content."  Check Point Blog.

Franceschi-Bicchierai, Lorenzo.  2023.  "Coinbase Says Some Employees' Information Stolen by SMS and Voice Phishing."  TechCrunch.

Reber, R., Schwarz, N., and Winkielman, P.  2004.  "Processing Fluency and Aesthetic Pleasure."  Personality and Social Psychology Review 8(4): 364-382.

Kahneman, D.  2001.  Thinking, Fast and Slow.  Farrar, Straus, and Giroux.

Stamatatos, Efstathios.  2009  "A Survey of Modern Authorship Attribution Methods."  Journal of American Society for Information Science and Technology 60(3): 538-556.

Gehrmann, S., Strobelt, H., and Rush, A.  2019.  "GLTR: Statistical Detection and Visualization of Generated Text."

Mitchell, E., et al.  2023.  "DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature."

Kirchenbauer, Samy, et al.  2023.  "A Watermark for Large Language Models."

"Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals  arXiv: 2507.14805 (2025).

Return to $2600 Index