Thinking in Steps: What NY Times Connections Tells Us About AI’s Reasoning Evolution

8 min readFeb 12, 2025

“The test of a first-rate intelligence is the ability to hold two opposing ideas in mind at the same time and still retain the ability to function.”
- F. Scott Fitzgerald

AI models designed explicitly for complex reasoning tasks, such as the viral Deepseek R1 and OpenAI’s o-series, are the latest trend. While large language models (LLMs) are quite capable of many kinds of tasks, this new wave has been designed to excel specifically at multi-step logical thinking. These reasoning-focused models don’t use the traditional approaches of training ever-larger general-purpose language models but are optimized for the ability to break down problems into discrete parts, consider multiple hypotheses, and arrive at solutions through structured thinking. Traditional benchmarks like multiple choice question-answering, summarization, or even mathematical problem-solving don’t effectively capture these models’ capabilities — they often focus on memorization, single-step reasoning, or rely heavily on pattern matching rather than true logical deduction. We’re constantly looking for new evaluation methods that better demonstrate the unique strengths of these specialized reasoning systems.

(Note, with the release of Claude 3.7 with reasoning capabilities, we have updated the post on 2/25/25)

The Rise of Reasoning-Focused Models

The evolution from general-purpose language models to specialized reasoning systems is an important development in AI strategy. Rather than simply scaling up model size and training data to elicit emergent reasoning behaviors, these new systems are built with architectural innovations specifically designed to enhance logical thinking capabilities. Reasoning models implement reinforcement learning, reflective reasoning mechanisms, mixture of experts (MOE) architecture, and enhanced computational strategies that can be beneficial when solving complex problems. These architectural shifts are a move away from the traditional “predict the next token” approach toward systems that can actively evaluate and manipulate different logical pathways. The system talks through its reasoning out loud, creating a workspace of working and non-working possibilities.

The development of these reasoning-focused models has also been influenced by Chain-of-Thought (CoT) prompting techniques. While CoT was initially used to encourage better reasoning in traditional language models by asking them to show their work, it has become a fundamental principle shaping how these new models are designed and trained. The reasoning models are trained and optimized to break down problems into smaller components, maintain coherent lines of thought, and evaluate multiple possible solutions simultaneously — capabilities that were only simulated through careful prompting in earlier systems. These models don’t just generate plausible text, but instead actively engage in structured problem-solving processes, complete with the ability to backtrack when necessary and maintain consistency across multiple steps of reasoning. That sounds much more like how we reason through difficult problems.

NYT Connections as a Benchmark

How do we find out if these new models are better reasoners? The New York Times’ Connections game can be a surprisingly effective benchmark for testing AI’s reasoning capabilities. In this difficult daily puzzle, players are presented with 16 words and must organize them into four distinct groups of four, with each group connected by a common theme or relationship. Sounds easy, right? This game is particularly challenging because words often appear to belong to multiple categories, requiring players to consider the entire puzzle holistically rather than solving it piece by piece. The game’s simple premise masks a complex web of linguistic relationships, cultural knowledge, and logical deduction that must all work together to arrive at the correct solution.

https://help.nytimes.com/hc/en-us/articles/28525912587924-Connections

Connections is valuable as an AI benchmark because of its unique combination of reasoning challenges. The puzzle requires simultaneous pattern recognition across multiple dimensions — words might be related by meaning, usage, sound, cultural context, or more abstract associations. Solving the puzzle requires both divergent thinking (generating multiple possible groupings) and convergent thinking (determining which combination of groupings is correct). Unlike many traditional AI benchmarks that can be solved through simple pattern matching or statistical correlation, Connections requires iterative hypothesis testing where each attempt at grouping provides feedback that informs subsequent attempts. Success requires a sophisticated interaction between linguistic understanding and categorical reasoning — the ability to not just recognize relationships between words, but to understand how these relationships might compete or complement each other within the broader context of the puzzle. Being able to correctly group all of the words demonstrates some higher-level reasoning about linguistic concepts.

To find out how well various AI systems can solve today’s Connections puzzle, we provided the game’s directions and today’s 16 words (SHAKESPEARE, RATTLESNAKE, ANDROID, ROLLERBLADE, SONG, TITLE, SKATEBOARD, SKETCH, DONUT, DANCE, DEED, SAXOPHONE, CERTIFICATE, MONOLOGUE, PACIFIER, RECEIPT) to various traditional LLMs and newer, reasoning-based systems to find out how many categories they can get correct. We used today’s quiz because there is no chance the answers would be in the training dataset. Below are the results:

Reasoning systems tend to perform much better than regular LLMs.

Only the dedicated reasoning models successfully complete the Connections challenge — the best traditional LLMs are only able to identify one or two categories with the correct category label (I know getting three categories means you have the last one, but can you tell me why they are a group? That’s the kicker). Reasoning models are clearly better at reasoning through Connections! We’ll need something better to distinguish between the reasoning models because they’re currently performing at ceiling, a task for another day.

Claude 3.5 Sonnet’s single shot reasoning gets a couple groups correct.

Why Reasoning Models Excel at Connections

The success of specialized reasoning models on the Connections puzzle is based on their advantages in handling multiple competing hypotheses simultaneously. While traditional language models might excel at identifying individual word relationships, they often struggle to maintain and evaluate multiple potential groupings concurrently — an important requirement for Connections. Reasoning-focused models can actively track and compare different possible solutions throughout the solving process. This capability is particularly useful when words have multiple potential associations, as the models can maintain these competing interpretations in their reasoning space while methodically evaluating which combinations lead to the most coherent overall solution.

Deepseek R1’s internal monologue while reasoning gets it further along than traditional models.

What truly differentiates these models in solving Connections is their ability to operate across different levels of semantic abstraction while maintaining logical consistency. They can recognize patterns not just in direct word meanings, but across multiple dimensions simultaneously — including phonetic similarities, cultural references, idiomatic usage, and categorical relationships. This multi-level reasoning capability allows them to strategically eliminate incorrect groupings by considering how each decision affects the overall puzzle solution space. When a word like “RING” appears, these models can simultaneously evaluate its potential as part of jewelry-related terms, circular objects, wedding-related concepts, or sound-related words, while understanding how each interpretation affects the viability of other groupings. This balancing act is models’ core strength: the ability to maintain and manipulate complex logical relationships while working toward a coherent solution.

o1’s summarized reasoning shows attempts that don’t pan out.

Addition: However, reasoning capabilities don’t ensure correctness on Connections — new Claude 3.7 Sonnet with new reasoning capabilities and Deepseek’s R1 may not get much further along. I’m frankly surprised that Claude still struggled with the task with its new reasoning mode. It’s just one example of a benchmark that should include many examples, as the Simpsons reference might be particularly difficult.

A snippet of Claude 3.7 with reasoning’s output that only got 2 of the 4 categories correct.

Implications for AI Development

The success of reasoning-focused models on Connections suggests that we’re moving beyond simple pattern matching toward systems that can truly engage in structured, multi-step reasoning. AI systems can now maintain complex working memory, evaluate competing hypotheses, and make decisions that require understanding both specific details and broader context simultaneously. The fact that these models can solve Connections — a puzzle that many humans find challenging — indicates that we’re developing systems that better mirror human cognitive processes, especially our ability to hold multiple possibilities in mind while methodically working toward a solution.

o3-mini’s reasoning about the categories, including the Simpsons, the hardest category

These insights are likely to influence the next generation of AI development in several ways. First, we’ll likely see more specialized architectures designed to handle specific types of reasoning tasks, rather than just scaling up general-purpose models. The principles that make these models effective at Connections — enhanced working memory, multi-level pattern recognition, and structured hypothesis testing — have applications far beyond word games. These capabilities could be valuable in scientific research, where multiple hypotheses need to be evaluated simultaneously, or in complex medical diagnosis scenarios where various pieces of evidence need to be weighed and combined. Success on Connections also means that we need new benchmarks that specifically test reasoning capabilities. While traditional metrics like MMLU or GSM8K have been valuable, we need more sophisticated evaluations that can measure a model’s ability to handle complex, multi-step reasoning tasks that require both breadth and depth of understanding.

Conclusion

The recent emergence of reasoning-focused AI models is a significant step in the development of artificial intelligence. While solving word puzzles like Connections might seem like a narrow achievement, it represents a broader advancement in AI’s ability to work systematically, maintain complex working memory, and engage in structured problem-solving. We’re likely to see this trend continue with even more specialized AI architectures optimized for specific types of reasoning tasks, leading to systems that can handle increasingly complex intellectual challenges in fields ranging from scientific research, medical diagnoses, to strategic planning. Games and puzzles will continue to play an important and fun role in this evolution — not just as benchmarks, but as development tools that help us understand and improve AI’s reasoning capabilities. Just as chess once served as a north star for early AI development, modern puzzles like Connections are helping shape the next generation of AI systems that think more like we do: methodically, contextually, and with the ability to consider multiple possibilities simultaneously.