Beyond Memorization: How Reinforcement Learning Leads to Generalizable AI

9 min readFeb 16, 2025

“I cannot teach anybody anything. I can only make them think.”
- Socrates

When learning, there’s a crucial difference between truly understanding a concept and simply memorizing examples, like the difference between a student who really understands the underlying principles of mathematics versus one who has simply memorized formula solutions. In machine learning, there are two analogous approaches to training AI: Supervised Fine-Tuning (SFT), where models learn from correct examples, and Reinforcement Learning (RL), where models learn through their own trial and error with feedback. The choice between these approaches isn’t just a theoretical one — it directly impacts how well AI systems can handle new situations they’ve never encountered before. An AI system trained to navigate in San Francisco needs to apply its knowledge when operating in Oakland, just as a customer service AI must adapt its understanding of product policies to handle novel customer queries. Recent research by Chu et al. (go Bears!) in the paper SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training provides compelling evidence that the training method significantly influences whether an AI system develops genuine understanding that can generalize to new situations, or simply becomes good at memorizing its training data. Generalization is the key to real intelligence.

The Research Question

One of the biggest questions we AI researchers face is determining whether these systems are truly learning generalizable principles or just becoming particularly good at memorizing their training data. This distinction isn’t merely theoretical — it’s important for deploying AI in real-world situations where conditions often differ from training scenarios. The difficulty is in telling the two apart, both learning and memorization can lead to reliable performance on standard tests. Just as a student might achieve a perfect score on a test either through deep understanding or through rote memorization, an AI system’s impressive performance on familiar tasks doesn’t necessarily mean it can handle novel situations.

To differentiate between the two approaches, researchers devised an elegant comparative study using two different training approaches and two carefully crafted test environments. The first environment, called GeneralPoints, is a card game where the AI must compute a target number using card values — similar to the game “24” but with varying rules. The second environment, V-IRL, tests navigation abilities in real-world settings. By examining how systems trained with either Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL) perform when faced with variations in these environments — such as different card values or new cities to navigate — researchers could directly measure their ability to generalize. These environments were chosen because they allow for clear, measurable variations while maintaining the core challenges, making it possible to distinguish between true learning and mere memorization.

The Card Game Test (GeneralPoints)

The GeneralPoints environment presents AI systems with a simple challenge: given four playing cards, find a way to combine their values using basic arithmetic operations to reach a target number (typically 24). What makes this task particularly interesting is how it can be varied to test true understanding of mathematics. In the basic version, face cards (Jack, Queen, King) all count as 10, but in variations, they might count as 11, 12, and 13. This subtle change forces the AI to demonstrate whether it has learned the underlying principles of arithmetic problem-solving or has just memorized solutions for specific card combinations. The visual version of the task adds another layer of complexity by requiring the AI to first recognize the cards from images before solving the arithmetic problem.

The results from this test environment were impressive. When faced with new rule variations — changing the “rules of the game” by altering face card values — systems trained with RL showed adaptability by maintaining their strong performance even under the new conditions. However, systems trained with SFT showed significant performance drops when faced with these variations, suggesting they had memorized solutions rather than learning the underlying arithmetic principles. This result mirrors real-world business scenarios where decision-making systems must adapt to changing conditions — for instance, a financial trading algorithm needs to maintain effectiveness when market rules or conditions change, or a resource allocation system must adapt when new constraints are introduced. The superior generalization of RL-trained systems suggests they might be better suited for applications where adaptability to changing conditions is necessary.

The Navigation Test (V-IRL)

The V-IRL environment presents a more complex real-world challenge that tests AI systems’ ability to navigate through city streets while following natural language instructions. The AI receives both visual input (street-level images showing landmarks) and textual directions (like “turn left at the Ethiopian restaurant, then continue until you see Wood Tavern”). The task combines multiple types of reasoning: the AI must recognize visual landmarks, understand spatial relationships, and follow sequential instructions. The researchers could test generalization by training the AI in one city (New York) and testing it in completely different cities worldwide, with different architectural styles, landmark types, and spatial layouts.

The results from V-IRL were even more impressive to me than the card game test. RL-trained systems were adaptable, successfully navigating new cities they had never seen during training, with performance improving from 16.7% to 77.8% — significantly outperforming previous state-of-the-art systems. SFT-trained systems, however, struggled when moved to new environments, with their performance dropping to near-random levels. RL has an important advantage for real-world applications like autonomous vehicles or delivery robots, which must operate in constantly changing environments. Just as we can apply driving knowledge when visiting a new city, RL-trained systems appear to develop a more fundamental understanding of navigation principles that transfers across different environments. RL might be a necessary step for developing autonomous systems that can reliably operate in the diverse, unpredictable conditions of the real world.

Key Findings and Their Significance

One of my favorite findings from this research was the consistent pattern of RL-trained systems demonstrating superior generalization across both text-based and visual tasks. In both test environments, RL-trained systems continued their strong performance when faced with new variations, while SFT-trained systems showed significant performance drops. Perhaps most surprisingly, the researchers discovered that RL training actually improved the AI’s fundamental visual recognition capabilities — something previously thought to be primarily influenced by the initial training data and architecture. This improvement in visual processing wasn’t just a side effect — it appeared to be a key factor in the system’s ability to generalize, suggesting that RL helps AI systems develop more robust and flexible ways of processing sensory information.

RL performs better out of distribution, i.e., it generalizes better.

Despite RL’s clear advantages in generalization, the research also revealed an interesting nuance: SFT still plays an important role in the development of effective AI systems. The researchers found that attempting to train systems using RL alone, without first using SFT, generally failed to produce good results. SFT appears to serve a necessary function in teaching the AI systems basic “format” — how to structure their outputs and follow instructions — creating a foundation that RL can then build upon. The research highlighted the importance of verification steps during training, where the system checks and potentially corrects its work. Systems that were allowed to take more verification steps during training showed better generalization abilities, meaning self-correction is crucial for developing robust understanding. This result mirrors human learning, where the ability to check and correct one’s work often leads to deeper understanding.

Practical Implications

This research will directly affect how we approach AI development moving forward. While the current trend in AI often emphasizes collecting larger datasets for supervised training, this research suggests that incorporating more reinforcement learning might be necessary for developing AI systems that can reliably operate in real-world conditions. Applications where the AI needs to adapt to new situations — which describes most real-world use cases — will be impacted the most. The research also suggests a two-phase approach is optimal: using SFT to establish basic competencies and output formats, followed by RL to develop more robust and generalizable capabilities beyond memorization. This hybrid approach could become the new standard for developing AI systems that need to operate reliably in unpredictable environments.

The practical benefits of this approach could affect many industries. For autonomous vehicles, it shows how we can develop systems that can handle new cities and unusual traffic situations more reliably. In robotics, it could lead to robots that can adapt their learned skills to new environments and tasks rather than needing to be retrained for every variation. For everyday AI interactions, like virtual assistants or customer service systems, this approach could result in AI that better manages novel queries or unusual situations rather than just repeating memorized responses or being unable to stretch its thinking. The visual recognition improvement through RL is particularly relevant for applications involving computer vision, where RL could help develop more robust visual AI systems for everything from medical imaging to quality control in manufacturing. These improvements in generalization could lead to AI systems that feel more naturally intelligent and less brittle in their interactions with humans. Their intelligence will be truly generalizable.

Looking Forward

This research opens exciting new directions for AI development while also raising important questions about how we should approach the training of future AI systems. RL’s superiority in developing generalizable capabilities means we may need to fundamentally rethink how we train large AI models. Rather than focusing on expanding dataset size or model scale, the field might benefit from increased attention to training methods that promote genuine understanding. Deepseek R1’s RL-training approach has demonstrated how effective and efficient it can be. We are also shifting how we evaluate AI systems, moving beyond simple accuracy metrics on test sets to more sophisticated measures of generalization and adaptability. The findings suggest that RL might be necessary for developing more robust and flexible multimodal AI systems — ones that can effectively combine visual, textual, and other forms of information in flexible ways.

However, there are still many questions. While the research demonstrates RL’s advantages for generalization, we still don’t fully understand why RL produces these benefits or how to optimize the balance between SFT and RL that are both required in training. Questions about scalability and training efficiency also need to be addressed — RL training can be more computationally intensive and less stable than SFT. Future research needs to explore how these findings extend to other types of tasks and domains, and whether there might be ways to achieve similar generalization benefits through other training approaches. There’s also the broader question of how these insights might apply to the development of more advanced AI systems, particularly as we move toward more general artificial intelligence (AGI). The complementary roles of SFT and RL discovered in this research might provide clues about how to develop AI systems that combine reliable task performance with genuine adaptability and understanding.

Conclusion

This research by Chu et al. shows how different training approaches affect AI systems’ ability to learn and adapt. By showing that reinforcement learning leads to better generalization while supervised fine-tuning tends toward memorization, it provides insights for developing more capable and reliable AI systems. The finding that these methods work best in combination — with SFT providing a foundation that RL can build upon — draws a practical roadmap for future AI development. As we deploy AI in increasingly complex and unpredictable real-world situations, the ability to generalize learning beyond training examples becomes critically important. This research suggests that by combining training approaches and incorporating sufficient verification steps, we can develop AI systems that do not just memorize but truly understand — systems that can adapt to new situations, recognize novel patterns, and handle unexpected challenges. Generalizing AI can reliably serve everything from autonomous vehicles to medical diagnosis, while maintaining the flexibility to adapt as circumstances change.