Skip to main content

The Intervention Point Is In The Wrong Place

The Intervention Point Is in the Wrong Place

[ 8 min read · Part of a series on AI-powered moderation — start here ]

Current forum moderation is architecturally a logging system. Something happens, it gets reported, it enters a queue, someone reviews it. The entire pipeline runs after the fact. The damage — the angry replies, the derailed thread, the good-faith users who got drawn in and penalised — has already propagated by the time any of it is actioned.

The technical problem isn't that platforms lack the tools to do better. It's that the intervention point has been placed at the wrong end of the pipeline. This post describes a two-pass architecture that moves it.


The data model everything else builds on

Before getting into the passes, one structural observation that makes the whole system simpler than it might appear: every post in a forum already has exactly one of two states.

It is a root post — it replies to nothing. It is the origin of a thread. Or it is a reply — it replies to exactly one parent post. That parent reference is already stored in every modern forum database worth building on. Root posts have a null parent. Replies have a parent ID. That's it.

This means every moderation event already contains its own origin reference. When a reply gets flagged, the system automatically has two data points: the flagged reply and the post it was replying to. No thread reconstruction needed. No tree traversal. No inference. The relationship is explicit in the existing data structure. When the flagged post is itself a root post — parent is null — then the root post is the origin by definition.

Everything in this architecture reads from that single parent reference. The simplicity is intentional.


One question, not a category system

Most automated moderation is built around classification: is this hate speech, harassment, spam, misinformation? Each category gets a ruleset, the ruleset gets gamed, a patch gets applied, a new exploit emerges, and eventually the system is so laden with edge cases and exceptions that it fails the users it was built to protect more often than it catches the ones it was built to stop.

Both passes in this architecture replace the category model with a single open-ended question applied to every post:

Does this post contribute to the discussion it's entering?

Not "does it break rule 4b" or "does it match the harassment classifier." Just: does it engage, in good faith, with what it's responding to? Length is irrelevant. A three-word reply that genuinely acknowledges a point passes. A three-paragraph reply that derails, dismisses, or inflames without engaging fails. The AI assesses contribution in context — the post and its parent — using open-ended reasoning rather than pattern matching against a fixed taxonomy.

A category system can be reverse-engineered. Bad actors learn the boundaries and write to stay just inside them. Open-ended reasoning about contribution doesn't have boundaries to probe. The AI isn't checking against a list; it's reading the relationship between a post and what it's responding to. That's fundamentally harder to game.


The architectural non-negotiable: hide, don't delete

One requirement needs to be stated before either pass is described: the system only works if flagged posts can be hidden from other users pending review. This isn't a feature. It's the prerequisite that makes everything else coherent.

Without it, the pre-submission conversation becomes advisory. The staged escalation collapses — if you can't stop propagation at the source you have to monitor everything downstream, which is the expensive reactive problem you were trying to replace. And the outcome monitoring in Pass 2 loses its damage-mitigation function entirely.

A hidden post isn't deleted. The user can see it, edit it, respond to the AI's assessment of it. It's pending — not live, not gone. If review clears it, it goes live. If it doesn't, appropriate action follows. Either way, no one else saw it while the decision was being made.

There's an accidental UX benefit here. A post that can't go live immediately introduces a mandatory pause — a meaningful emotional cooldown for a user who wrote something in the heat of the moment. The system doesn't need to know the user is angry. The pause works regardless.

If a platform's architecture cannot support hiding a post before it propagates, this system cannot be fully implemented on that platform. That's worth knowing early.


Pass 1 — Pre-submission assessment

Pass 1 operates at the point of composition, before a post reaches anyone else. It runs in three stages.

Stage 0 — Passive contribution check

Every post assessed against the contribution question before submission. The AI reads the post and its parent — two data points, always available — and makes a preliminary call: does this contribute, or does it warrant a closer look? Clean posts go live immediately. Flagged posts move to Stage 1.

Cost: a budget model reading a post and its parent — a few hundred tokens at most. At Reddit's scale of approximately 45 million comments per day this runs to around $82,000 per year. Less than the salary of a single mid-level trust and safety employee. The foundation of the entire system.

Stage 1 — Conversational review

A flagged post doesn't get hidden yet. The user gets a conversation first — focused AI dialogue, strictly scoped to this post and its parent context. The AI explains what it noticed, asks what the user is trying to say, offers the chance to revise. Most good-faith users will engage, clarify, and either revise or abandon the post. That's a moderation outcome with no human involvement and no post going live.

Stage 1 has two user populations that behave very differently. Good faith users engage and revise — that's the normal case. Bad faith users face a different calculation: insisting past Stage 1 means guaranteed human review, so they won't insist. They'll withdraw, or probe different framings until something passes. Either way they're revealing themselves. Withdrawal the moment human review is mentioned is an admission the post wouldn't survive scrutiny. A pattern of attempts — different angles on the same type of non-contributory content — is a behavioural fingerprint that Stage 3 will eventually connect.

All of this gets logged regardless of outcome. Withdrawn posts log the same way attempted crimes are recorded even when thwarted. If the user insists on posting without revision the post is hidden and moves to Stage 2. One exception: unambiguously toxic content with no plausible good-faith interpretation skips the conversation and hides immediately.

Once users understand that insisting past Stage 1 means human review, withdrawal rates will be significantly higher than volume estimates suggest. Bad faith actors won't push to Stage 2 if they know a human is waiting. The human review queue will be smaller than anticipated — and every withdrawal still counts.

Stage 2 — AI-assisted human review

The post is hidden. This stage is for one specific case: a user who genuinely believes the AI misread their post and is willing to put that in front of a human. The AI reads the post, its parent, and the account's stage history, and produces a natural language summary — not a category tag, a description of what the post appears to be doing and what the account's pattern suggests. The Stage 1 conversation transcript is attached. A human moderator makes the final call with full context. Their decision is documented and defensible.

The deterrent isn't the AI check. It's the guaranteed human review here. Under current systems the path from "bad post goes live" to "human carefully reviews it" is long, conditional, and frequently never completed. Bad faith actors count on that. Here it is short and inevitable — which is precisely why most of them won't reach this stage. They reveal themselves at Stage 1 instead.


Pass 2 — Outcome monitoring

Pass 1 has a known gap: a sophisticated bad actor who crafts their instigating post carefully enough to pass the contribution check gets through. Pass 2 exists for exactly that case.

Once a post is live, the system watches what it generates. Specifically: how many of its direct replies get hidden at Stage 0, compared to the baseline hiding rate for similar posts in that community. When a live post accumulates hidden replies at an anomalous rate, the post gets flagged for human review automatically — regardless of how it looked at submission.

The signal here is deliberately simple: hidden reply count against baseline. Not content judgment on the instigating post — just anomaly detection on what it produced. One query. One number. One threshold. The system doesn't need to understand why the replies are bad. It just needs to observe that an unusual number of them are.

When Pass 2 flags a post, the automated context package is trivially simple to generate because the data model already provides everything needed:

  • The flagged post and its content
  • Its parent post — or null if it's a root post
  • The count of hidden replies and the community baseline
  • The account's stage history

A human moderator reviews this package and makes one of two calls. Either the post is a legitimate controversial opinion that generated bad replies from other users — in which case it gets cleared, the subsequent bad replies become the origin of the problem, and their authors' accounts get the attention — or the post is identified as the instigator, and appropriate action follows. Either way the system has correctly identified where to look. The damage mitigation has already happened: the hidden replies never propagated.

There is an important additional framing for Pass 2: when a reply gets hidden, the system is not only protecting other users from seeing a bad reply. It is simultaneously flagging the content that reply was responding to for closer attention. The hidden reply is both a moderation action and a piece of evidence about its parent.


The self-closing loop

Pass 1 and Pass 2 are independent safety nets. Gaming one doesn't defeat the other.

A bad actor who games Pass 1 gets caught by Pass 2 — their clean-looking post accumulates hidden replies and gets flagged. A bad actor who games Pass 2 — crafting replies that pass Stage 0 to avoid triggering outcome monitoring — is no longer a bad actor by the system's definition, because their replies are contributing to the discussion. The system's goal was never to catch bad actors specifically. It was to protect discussion. If someone learns to contribute to discussion to avoid being caught, the system has succeeded.

Human review decisions feed back into both passes. A Pass 2 clearance — this was legitimate controversy, not ragebait — tells the system that this type of post in this community context shouldn't trigger outcome monitoring at the current threshold. A Pass 2 confirmation — this was an instigating post — tells the system what a gamed Pass 1 looks like. Both signals improve calibration over time. The system gets harder to game as it accumulates more human review decisions, not easier — the opposite of every static ruleset ever deployed.


Stage 3 — Asynchronous behavioural analysis

Not triggered by a single post. Triggered by patterns across an account's stage history over time — accumulated Stage 1 attempts that never escalate but never quite pass cleanly, consistent Pass 2 flags tracing back to the same account, posts that keep originating hidden reply clusters from multiple other accounts. This runs in batch, asynchronously, on a small subset of accounts. It produces account-level recommendations for human decision: monitor, warn, restrict, or refer for ban review. The AI has connected dots across what would otherwise be dozens of disconnected incidents — including the bad actor who was careful enough never to push a post through Pass 1.


Log stages, not reasons

The logging model is worth stating explicitly because it's where most current moderation infrastructure goes wrong. Existing systems try to do two jobs with one tool — the violation category system attempts to be both the counter and the interpreter simultaneously. So you get "Harassment: 3 incidents, Spam: 1 incident, Other: 7 incidents." The Other bucket is where all the nuanced bad behaviour lives, unexamined. Every new type of bad behaviour either gets misclassified or makes the Other pile bigger.

This architecture separates those jobs completely:

Log stages, not reasons. Count how far a submission went, not why it got there. Interpretation is the AI's job, not the database's.

For Pass 1: on every submission attempt, increment the counter for the highest stage reached. One integer per stage per account. For Pass 2: on every live post, maintain a running count of hidden direct replies and a flag for whether the anomaly threshold has been crossed.

That's the entire logging requirement. No categories. No reason codes. Nothing to recategorise when a new type of bad behaviour emerges.

What this produces naturally is an account profile that speaks for itself:

Metric Account A Account B
Stage 0 passes 198 156
Stage 1 reached 2 23
Stage 2 reached 0 8
Pass 2 flags on own posts 0 5
Stage 3 reached 0 1

No violation categories needed. The shape of the numbers is the signal. The AI reads it in combination with the current submission and produces a natural language summary for the moderator. Account A and Account B tell completely different stories without a single category label between them.

The system is also future-proof. Novel bad behaviour doesn't need a new category. Its stage counts and Pass 2 flags are already in the data, waiting to be found.


Reporting still has a role — but a different one

This system doesn't eliminate user reporting. Stage 0 will sometimes pass content it shouldn't, and Pass 2 outcome monitoring won't catch everything before damage occurs. Reporting catches what both passes miss. What changes is what a report triggers: not a quick categorisation against a checklist, but a full review with the post, its parent, the account's stage history, and Pass 2 data already assembled. The moderator isn't sorting. They're reading a case file.

Every report that catches a miss is also a calibration signal — evidence that the contribution check needs recalibrating for this community or content type. Reports become feedback that improves the system rather than purely reactive workload.


What this costs at scale

Using Reddit's approximately 45 million daily comments as a benchmark:

Component Daily Volume Est. Daily Cost Est. Annual Cost
Pass 1 Stage 0 45M posts ~$225 ~$82K
Pass 1 Stage 1 (~10% flagged) ~4.5M ~$2,250 ~$820K
Pass 1 Stage 2 (~10% of Stage 1) ~450K ~$4,500 ~$1.6M
Pass 2 outcome monitoring All live posts ~$200 ~$73K
Pass 2 flagged post review prep Small subset ~$100 ~$36K
Human review (Pass 1 Stage 2 + Pass 2) ~50K Human time
Stage 3 async batch Small subset ~$500 ~$180K

Total estimated annual AI cost at Reddit's scale: approximately $2.8 million — adding Pass 2 adds roughly $109K annually to the previous estimate. The outcome monitoring layer costs almost nothing because it's anomaly detection on a counter, not an LLM call on every post. Pass 2 only triggers expensive processing on the small subset of posts that cross the anomaly threshold.

At Reddit's 2025 revenue of $2.2 billion this is 0.13% of revenue for a complete two-pass proactive moderation system. The "too expensive" objection still doesn't survive contact with the numbers.


What already exists and what this goes beyond

Three major platforms have already built pieces of this — which means the foundational technology is proven at scale. None has connected the pieces into a system that closes the loop.

Instagram has had pre-submission AI assessment since 2019 — asking users "are you sure?" before posting comments flagged as offensive, and expanding the check to captions shortly after. At scale, this blocks millions of harmful interactions daily. But the check matches against previously reported content. It doesn't assess contribution in context. There is no conversational layer. There is no outcome monitoring. And there is no guaranteed path to human review — the system is entirely automated, with no self-closing loop.

Reddit's Post Check uses an LLM to flag potential rule violations before submission — communities using it saw a 35% reduction in posts requiring moderator removal. But it only checks explicit community rules, not contribution or context. No hiding pending review. No account-level stage logging. No Pass 2.

Steam's automated checks catch spam, malware, and adult content proactively at publication — the right architectural instinct applied to the narrowest possible scope. For everything else, Steam's game forums rely on fragmented volunteer moderation with a reactive report queue. Users describe the result as mechanical and context-free, unable to distinguish instigator from reactor.

All three have validated that pre-publication AI assessment works at scale. None has asked what a post produces after it goes live, or used that as a signal about the post itself. None has connected pre-submission assessment to guaranteed human review for anything that won't pass. None catches what it misses — they fail silently. This architecture doesn't.


The open problems worth solving

Coordinated bad-faith reply attacks. Posting coordinated bad replies to get a legitimate post flagged by Pass 2 is structurally identical to coordinated mass reporting in current systems. But it's less dangerous here: the coordinated replies get hidden — which protects the target rather than harms them — and the flagged post goes to human review rather than automatic action. A moderator reviewing a legitimate post with an anomalous hidden reply pattern from accounts with their own suspicious stage histories will see the coordination. The attack backfires: the target's thread is protected and the coordinators expose themselves to Stage 3 analysis. The residual risk is queue burden — flooding Pass 2 reviews to overwhelm human capacity. Stage 3 detects coordinated hiding patterns relatively quickly, and anomaly thresholds can weight new or previously flagged accounts differently.

Unthreaded platforms. The architecture relies on explicit parent references. Stream-based platforms like Discord channels, where users post consecutive replies without using the reply function, require inferring conversational context from temporal proximity and semantic similarity rather than data structure. Solvable but harder. Platforms with proper threading are the right starting point.

False positives. A legitimate post hidden before anyone sees it is a more serious intervention than one flagged after the fact. Stage 1 mitigates this by giving users a chance to clarify before anything is hidden. False positive rate needs to be a first-class metric alongside recall — a system that hides too aggressively drives away good-faith users faster than bad-faith ones.

Calibration per community. The contribution threshold that works in a debate forum is different from one in a customer support forum. The question is universal; the threshold needs to be tunable per community and should learn from human review decisions over time.

Profile flag privacy. Recording flagged attempts — including posts that were revised and approved — raises data retention and transparency questions that need answers before deployment, not after.


This is part of a series on AI-powered moderation. Read the overview · For moderation teams · Continue to: The cost of doing nothing (for platform decision makers).

Comments