Safety & Alignment
Content Moderation
Quick Answer
Evaluating and filtering model outputs or user inputs for inappropriate or harmful content.
Content moderation checks text for harmful content. Moderation uses classifiers to identify problematic outputs. Moderation can be rule-based (keyword lists) or learned (ML classifiers). Moderation requires defining what's inappropriate. Moderation tradeoff: strictness vs. utility. Moderation is computationally efficient as a filtering layer. Moderation is widely used for safety. Moderation is imperfect.
Last verified: 2026-04-08