Skip to main content

The Moderation Problem Nobody Is Solving

The Moderation Problem Nobody Is Solving

[ 5 min read · Hub post for a series on next-generation AI moderation ]

By the time a moderation report gets filed, the damage is already done. The post has been seen, reacted to, and replied to. Those replies have generated their own replies. Good-faith users have been drawn into something they didn't start. The person who lit the match is still posting. And somewhere in a queue, a moderator is about to open a ticket that describes the fire — with no information about what caused it.

This is how every major community platform moderates in 2026. Not because better tools don't exist. Because nobody has connected the tools that do exist into a system that actually closes the loop.


The match, not the fire

Two types of content generate a disproportionate share of moderation work on every platform, and neither looks like a rule violation on the surface.

The first is the ragebait topic. A user opens a thread designed to draw sides rather than start a conversation — absolute language, faction framing, a thesis calculated to provoke. The replies arrive fast. Within an hour the thread is a war zone. A report gets filed. A moderator actions the most visible offenders — most of whom were reacting to a post that was never reported. The instigator is still posting.

The second is the irrelevant or dismissive reply. A user posts a detailed, good-faith question. Someone replies with something that has nothing to do with it — generic advice that doesn't apply, a response that could belong to any thread. The original poster pushes back, confused. A third party piles in. Now it's a fight, and the queue has another ticket for a conflict that started with a reply that technically didn't break a single rule.

In both cases, current moderation saw the explosion. It never saw what caused it. The report system is structurally blind to the match because the match is rarely what gets reported.


What already exists — and where it stops

Three major platforms have already proven that AI assessment of posts before they go live works at scale.

Instagram has been asking users to reconsider comments flagged as offensive before posting since 2019 — blocking millions of harmful interactions daily. Reddit's Post Check uses an LLM to flag potential rule violations before submission — communities using it saw a 35% reduction in posts requiring moderator removal. Steam checks every post for spam and malicious content at the moment of publication, proactively, without waiting for reports.

Each of these works. Each addresses a narrow slice of the problem. None assesses whether a post actually contributes to the discussion it's entering. None watches what a live post produces and uses that as a signal about the post itself. None catches the instigating post — only the reactions it generates. And none connects pre-submission assessment to a guaranteed human review path for anything that won't pass.

The pieces exist. The closed loop doesn't.


Closing the loop — two passes, not one

The system this series proposes runs on a single question applied to every post before it goes live:

Does this post contribute to the discussion it's entering?

Not "does it match a violation category." Not "does it resemble something previously reported." Does it engage, in good faith, with what it's responding to? That question is harder to game than any ruleset because there are no category boundaries to probe. The AI isn't checking a list. It's reading the relationship between a post and its parent.

The first pass operates before a post reaches anyone else. Clean posts go live with no friction. Flagged posts enter a brief AI conversation — not a warning prompt, a dialogue — where the user is asked what they're trying to say and given the chance to revise. Most good-faith users engage and either improve their post or realise they don't need to post at all. Users who dismiss the conversation and insist on posting face a hidden post pending human review. Not deleted — hidden. Nobody else sees it while a considered decision is being made.

The second pass operates after a post goes live. Every reply gets assessed against the same contribution question. When a live post starts accumulating replies that keep failing that assessment — hidden before other users see them — at an anomalous rate, the original post gets flagged for human review automatically. Not because anyone reported it. Because the system observed what it produced. Each hidden reply is simultaneously a moderation action and evidence about the post that provoked it.

A bad actor who games the first pass gets caught by the second. A bad actor who games the second — by learning to write replies that pass the contribution check — is no longer a bad actor by the system's definition. The two passes are independent. Gaming one doesn't defeat the other. Human review decisions feed back into both, making the system harder to game over time rather than easier — the opposite of every static ruleset ever deployed.


Moderation that doesn't feel like moderation

The AI is reading every post at the point of submission anyway — that's the moderation function. The question is what else it does while it's reading.

A well-designed AI layer uses that reading to do several useful jobs simultaneously. It checks whether a sufficiently similar thread already exists — not by keyword matching, which returns too many results for anyone to bother checking, but by semantic similarity — and redirects users to existing discussions rather than letting the same conversation happen twelve times. It helps users articulate what they actually mean, catching misreadings before they cause conflict and embarrassment. And underneath all of that, it assesses contribution, logs attempt patterns, and watches post outcomes.

The best moderation is the kind users don't experience as moderation. A system that users experience as a helpful forum feature — familiar, conversational, genuinely useful — while running a complete proactive moderation function underneath is a better product than a surveillance layer that announces itself as a moderation tool. LinkedIn understood something related when it added the AI post improvement button. Users who press it don't feel policed. They feel helped. The moderation function here is the same tool, running continuously, whether the user initiates it or not.


The series

This hub post is the overview. The argument goes deeper in three directions, each written for a different audience:

  • Your Moderation Team Is Solving the Wrong Problem — for community managers and moderation team leads
    Why the report queue is full of downstream effects, what the match actually looks like before it becomes a fire, and what your team gets to do when the system handles volume so they can handle complexity
  • The Intervention Point Is in the Wrong Place — for developers and platform builders
    The full technical architecture — contribution assessment, the two-pass model, stage logging without categories, cost estimates at scale, and the open problems worth solving
  • You're Going to Build an AI Layer Anyway. Build the Useful Version. — for platform decision makers
    The cost of inaction, the honest objections and why they're less compelling than they appear, the regulatory trajectory, and why this is the answer to the question of what community platforms are for in an AI world

The question worth asking

Instagram proved the pre-submission nudge works in 2019. Reddit proved LLM-based post assessment works at scale. Steam proved proactive pre-publication checking works. The technology has been proven in pieces, on some of the world's largest platforms, for years.

Nobody has connected the pieces into a system that catches the instigating post, not just the reactions it generates. Nobody has built the second pass that watches what live posts produce and uses that as a signal about the post itself. Nobody has designed it to feel like a useful forum feature rather than a surveillance layer.

The question isn't whether this is technically feasible. It demonstrably is. The question is why the platforms with the resources to build it haven't — and whether the ones reading this are going to be the ones that do.


Thoughts? Pushback? Examples from your own experience with moderation that works — or doesn't? The comments are open. And yes, I'll be reading them to see whether the replies actually address what was said.

Comments