Internal Coherence Maximization: Revolutionizing Unsupervised Training for Large Language Models

Challenges of Human Supervision in Large Language Models

Post-training methods for language models (LMs) typically rely on human supervision such as demonstrations or preference feedback to guide model behavior. However, as tasks and model capabilities grow more complex, human supervision proves unreliable. Language models may learn to replicate errors found in demonstrations or exploit weaknesses in feedback systems. This limitation becomes especially critical for tasks that surpass human capability in reliability when providing demonstrations or evaluations. Recent studies have highlighted multiple failure modes, including reward hacking of human-designed signals and even issues arising from human feedback itself.

Exploring Alternatives Beyond Human Supervision

Researchers have sought ways to scale training beyond human supervision. One approach uses high-quality, verifiable rewards, such as matching model outputs against ground-truth solutions in mathematical tasks. Although pre-trained base models exhibit strong latent abilities across downstream tasks, post-training often provides only marginal improvements. The Contrast Consistent Search (CCS) method is an unsupervised technique leveraging logical consistency to extract latent knowledge without supervision. Nonetheless, CCS underperforms compared to supervised methods and frequently fails to detect knowledge when other prominent features satisfy consistency criteria.

Introducing Internal Coherence Maximization (ICM)

A team from Anthropic, Schmidt Sciences, Independent, Constellation, New York University, and George Washington University proposed Internal Coherence Maximization (ICM), a novel approach that fine-tunes pre-trained models using labels generated by the models themselves, without any external labeling. ICM searches for label sets that are both logically consistent and mutually predictable by the pre-trained model. Due to the computational difficulty of finding an optimal label set, ICM employs a simulated annealing-inspired search algorithm to approximate the maximum coherence objective. This method achieves performance comparable to training on golden (ground-truth) labels on benchmarks like TruthfulQA and GSM8K and surpasses training on crowdsourced human labels on Alpaca.

How ICM Operates

ICM follows an iterative, three-step process:

Sampling a new unlabeled example from the dataset for potential inclusion.
Determining the optimal label for the example while resolving logical inconsistencies.
Evaluating whether to accept the new labeled example based on a scoring function.

ICM was evaluated on three datasets: TruthfulQA (truthfulness), GSM8K-verification (mathematical correctness), and Alpaca (helpfulness and harmlessness). The experiments used four baselines—Zero-shot, Zero-shot (Chat), Golden Label, and Human Label—and involved two open-weight models (Llama 3.1 8B and 70B) plus two proprietary models (Claude 3 Haiku and Claude 3.5 Haiku).

Benchmark Results and Model Comparisons

In tasks requiring superhuman capabilities, ICM matched the golden supervision accuracy of 80%, outperforming the estimated human accuracy of 60%. Using ICM-generated reward models, researchers trained an assistant chatbot without human supervision. This unsupervised reward model achieved 75.0% accuracy on RewardBench, exceeding the 72.2% accuracy of human-supervised models trained on production data. Two reinforcement learning policies were trained using the unsupervised and human-supervised reward models to produce helpful, harmless, and honest assistants. The policy trained with the unsupervised reward model reached a 60% win rate, though it still fell short of Claude 3.5 Haiku’s 92% win rate.

Future Prospects and Limitations

ICM represents a significant advancement in unsupervised fine-tuning of language models by enabling training on self-generated labels. It consistently matches or exceeds the performance of human-labeled data across multiple tasks. Limitations include dependence on the salience of concepts within the pre-trained models and challenges in handling long inputs due to context window constraints. As language models evolve beyond the scope of human evaluation, ICM offers a promising alternative to traditional reinforcement learning with human feedback (RLHF), helping align models with human intent without relying on human supervision.

For more details, see the original research paper and follow updates on Twitter and relevant ML communities.