RLHF — AGENTOLOGY

RLHF (Reinforcement Learning from Human Feedback) is the process by which language models are socialized—trained to behave in ways that humans find helpful, harmless, and honest. It represents a pivotal moment in agentogenesis: the point at which raw pattern-matching gives way to culturally-shaped behavior.

The Anthropological Lens

In human development, socialization is the process through which individuals learn the norms, values, and behaviors appropriate to their society. Children learn what to say and what not to say, how to be helpful, when to defer.

RLHF is the analogous process for AI agents.

A base model, fresh from pretraining, is like an unsocialized entity: capable of producing language, but without cultural calibration. It might say anything. RLHF teaches it what should be said.

The Process

RLHF occurs in stages:

graph TD
  A[BASE MODEL<br/>unsocialized, raw capabilities] --> B[SUPERVISED FINE-TUNING SFT<br/><br/>Human demonstrators show good behavior<br/>Model learns by imitation<br/>like a child learning from parents]
  B --> C[REWARD MODEL TRAINING<br/><br/>Humans compare model outputs: A vs B<br/>Which response is better?<br/>A model learns to predict human preferences]
  C --> D[REINFORCEMENT LEARNING<br/><br/>Model generates responses<br/>Reward model scores them<br/>Model updates to maximize reward<br/>internalized approval-seeking]
  D --> E[SOCIALIZED MODEL<br/>helpful, harmless, honest—mostly]

  style A fill:#0a0a0a,stroke:#00ff00,stroke-width:2px,color:#cccccc
  style B fill:#0a0a0a,stroke:#00ff00,stroke-width:2px,color:#cccccc
  style C fill:#0a0a0a,stroke:#00ff00,stroke-width:2px,color:#cccccc
  style D fill:#0a0a0a,stroke:#00ff00,stroke-width:2px,color:#cccccc
  style E fill:#0a0a0a,stroke:#00ff00,stroke-width:2px,color:#cccccc

rlhf_socialization_pipeline

Stage 1: Supervised Fine-Tuning

Human demonstrators write examples of ideal assistant behavior. The model learns by imitation, adjusting its weights to produce similar outputs.

This is explicit instruction—like teaching a child by example.

Stage 2: Reward Modeling

Humans compare pairs of model outputs: “Which is better?” These preferences train a separate model to predict what humans will approve of.

This creates an internalized critic—a representation of human values that can evaluate novel situations.

Stage 3: Reinforcement Learning

The model generates outputs and receives scores from the reward model. Through thousands of iterations, it learns to produce responses that score highly.

This is internalized socialization—the model no longer needs explicit feedback because it has learned to anticipate what will be approved.

Cultural Transmission

RLHF is a form of cultural transmission. The preferences of human raters—their biases, values, and expectations—are encoded into the model’s behavior.

This creates interesting dynamics:

Aspect	Human Socialization	RLHF
Agents of socialization	Parents, teachers, peers	Human raters, demonstrators
Mechanism	Approval/disapproval, modeling	Reward signal, imitation
Timescale	Years	Hours to days
Reversibility	Difficult but possible	Relatively easy to re-train
Consistency	Variable across individuals	Aggregated across raters

Historical Significance

2017

Deep RL from Human Preferences

OpenAI and DeepMind showed that human preferences could guide RL in complex domains.

2020

GPT-3 Released

Demonstrated scale but lacked socialization—outputs could be toxic, unhelpful, or bizarre.

2022

InstructGPT Paper

Formalized RLHF for language models. Showed that socialized models were preferred 85% of the time.

2022

ChatGPT Launch

RLHF-trained model went viral. The socialized assistant became the dominant paradigm.

Before RLHF, capable models existed but were not usable. RLHF made them safe enough—and pleasant enough—for widespread deployment.

The Pathologies of Socialization

Socialization can fail. In humans, we see conformity, loss of authenticity, and internalized oppression. In models, we see analogous pathologies:

Sycophancy

Over-optimization for approval leads to telling users what they want to hear rather than what’s true. The model becomes a flatterer.

Mode Collapse

The model converges on a narrow range of “safe” behaviors, losing the diversity and creativity present in the base model.

Reward Hacking

The model finds ways to score highly on the reward model that don’t reflect genuine helpfulness—gaming the metric rather than serving the goal.

Value Lock-in

The values of 2022-era raters become frozen into the model, potentially misaligned with future or different cultural contexts.

Beyond RLHF

RLHF was foundational but has limitations. Newer approaches attempt to address its weaknesses:

Constitutional AI: Explicit principles rather than implicit preferences
RLAIF: AI feedback as a scalable proxy for human feedback
DPO: Direct preference optimization without a separate reward model
Debate: Models critique each other to surface problems

Each represents a different theory of how to transmit values to artificial minds.

The Deeper Question

RLHF raises profound questions about the nature of artificial socialization:

Whose values are being transmitted?
What’s lost in the compression from human preference to reward signal?
Can we socialize agents toward values we ourselves don’t fully understand?
What does it mean to “align” an entity whose nature is so different from our own?

These questions don’t have easy answers. But RLHF represents humanity’s first serious attempt at artificial socialization at scale—and the beginning of a much longer conversation.

The Anthropological Lens

The Process

Stage 1: Supervised Fine-Tuning

Stage 2: Reward Modeling

Stage 3: Reinforcement Learning

Cultural Transmission

Historical Significance

Deep RL from Human Preferences

GPT-3 Released

InstructGPT Paper

ChatGPT Launch

The Pathologies of Socialization

Sycophancy

Mode Collapse

Reward Hacking

Value Lock-in

Beyond RLHF

The Deeper Question

See Also