Bridging In-Context Learning and Fine-Tuning: New Advances in Language Model Generalization

Challenges in Fine-Tuning Language Models

Language models (LMs) pretrained on vast internet text corpora excel at in-context learning, enabling them to generalize effectively from only a few examples. However, fine-tuning these models for specific downstream tasks remains challenging. Fine-tuning often requires hundreds or thousands of examples, yet the generalization patterns it produces can be limited. For instance, models fine-tuned on statements like “B’s mother is A” often struggle to answer related questions such as “Who is A’s son?” despite LMs handling such reverse relations well in in-context learning.

Research Approaches to Improve Adaptability

Researchers have explored several strategies to enhance LM adaptability. Studies on in-context learning analyze learning and generalization both empirically and theoretically. Out-of-context learning examines how models use information outside explicit prompts. Data augmentation techniques leverage large language models (LLMs) to improve performance from limited datasets, addressing challenges like the "reversal curse" through methods such as hardcoded augmentations, deductive closure training, and generating reasoning pathways. Synthetic data generation has evolved from hand-designed examples to automatic creation using language models themselves.

Datasets and Evaluation Methods

A collaboration between Google DeepMind and Stanford University produced datasets isolating pretraining knowledge to create clean generalization tests. Evaluations expose pretrained models to controlled subsets of information both in-context and via fine-tuning, using multiple-choice likelihood scoring without revealing answer choices in context. Experiments included fine-tuning Gemini 1.5 Flash with batch sizes of 8 or 16 and in-context evaluation combining training documents as context for instruction-tuned models, with random subsampling to minimize interference.

Innovative Dataset Augmentation

A key innovation is augmenting fine-tuning datasets with in-context inferences to improve coverage. Two augmentation strategies were used: local and global, each employing distinct contexts and prompts. On the Reversal Curse dataset, in-context learning nearly reaches ceiling performance on reversals, while conventional fine-tuning struggles, often favoring incorrect answers seen during training. However, fine-tuning with augmented data matches the high accuracy of in-context learning. Similar trends appear with nonsense reversals, albeit with less pronounced improvements.

Performance on Logical Inferences

For simple syllogisms, pretrained models perform at chance level, indicating no contamination of data. Fine-tuning achieves above-chance generalization for some syllogism types where logical inference aligns with linguistic patterns. Yet, in-context learning still outperforms fine-tuning, and augmented fine-tuning yields the best overall results.

Implications and Future Directions

This research highlights key differences in generalization between in-context learning and fine-tuning under novel information structures. The superior generalization ability of in-context learning inspired methods to incorporate in-context inferences into fine-tuning datasets, enhancing downstream task performance. Limitations include reliance on nonsense words and specific model focus, calling for future studies across diverse models, especially newer reasoning architectures.