Preprint  ·  Under review

Do Safety Guardrails Need to Reason?
LeanGuard: A Fast and Light Approach for Robust Moderation

Correspondence: dongbinna@postech.ac.kr
📄 Paper (arXiv) Code (GitHub) 🤗 Models · soon Colab · soon
Cost-accuracy plane: LeanGuard matches reasoning guards at ~100x lower inference cost.

LeanGuard, a 395M label-only encoder, matches much larger chain-of-thought reasoning guards at about ~100× lower inference cost and a single forward pass.

/ Abstract

In order to screen a prompt or a response, recent guardrail methods generate a chain-of-thought (CoT) before they issue a verdict, following a common belief that step-by-step reasoning improves a decision. However, CoT also makes the guard heavy and slow, because the model must generate many tokens before it decides, which may not match how guardrails are actually deployed. A guardrail is often expected to be lightweight and fast, and it often runs on-device, for example on an embodied robot. In this paper, we ask whether a safety guardrail really needs to reason. We train a lightweight bidirectional encoder and a reasoning guard on the same corpus, then remove only the reasoning while keeping everything else fixed. With this controlled same-base comparison, we show that the chain does not improve moderation accuracy. We name the resulting guard LeanGuard. A 395M label-only encoder reaches an average F1 of 82.90±0.26 over public benchmarks, matching a reasoning guard built on a much larger decoder while using only a single forward pass about a ~100× reduction in inference compute. We further show that this label-only encoder stays robust under training-label noise and retains far more recall at a strict false-positive rate than the reasoning guard.

/ Two Misconceptions

(M1) CoT is necessary for accuracy

On the same decoder base, adding a chain-of-thought does not improve accuracy. Re-sampling the chain changes the verdict on only ~5% of inputs, and a linear probe shows the verdict is fixed before the chain is written the reasoning justifies a decision rather than computing it.

(M2) Heavier reasoning is more robust

Under injected training-label noise a single-pass encoder degrades more gracefully than generation, and at a strict 1% false-positive rate it retains 44.8 recall versus the reasoning guard's 10.1.

The chain-of-thought of a guard decoder is largely post-hoc.

The decoder's verdict is essentially decided before the chain is generated; the later reasoning restates it at ~100× the cost.

/ Main Results

Headline F1 (unweighted mean over prompt-harm, response-harm, refusal), evaluated cell-for-cell under the GuardReasoner protocol over eleven public benchmarks.

ModelBackboneParamsCoTSingle passHeadline F1
WildGuard-7Bdecoder7B81.96
GuardReasoner-1Bdecoder1.24B82.05
GuardReasoner-3Bdecoder3B82.50
LeanGuard (ours)encoder395M82.90

LeanGuard matches or exceeds the reasoning guards at about ~100× lower inference cost, wins or ties 9 of 13 per-benchmark cells against GuardReasoner-1B, and a free 3-seed vote reaches 83.35. A classifier trained with 10% of its labels corrupted (82.16) still matches a clean GuardReasoner-1B (82.05).

/ Release Roadmap

Paper (arXiv preprint)Available
Figures and project pageAvailable
Training and evaluation codeComing soon
Google Colab demo (one-click reproduction)Coming soon
Pretrained checkpoints + ONNX export (Hugging Face)Coming soon
Training / evaluation dataset splitsComing soon
The paper, README, and figures are public now. The code, models, ONNX export, and dataset splits are being released together with this paper and will be linked here shortly.

/ BibTeX

@article{na2026leanguard,
  title   = {Do Safety Guardrails Need to Reason? LeanGuard: A Fast
             and Light Approach for Robust Moderation},
  author  = {Na, Dongbin},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}