Drawing Conclusions: The Rise of Visual Reasoning in AI with Multimodal Visualization-of-Thought

Greg Robison
8 min readJan 20, 2025

--

“The soul never thinks without a picture.”

- Aristotle

When you’re trying to explain a complex idea to someone, do you just use words? Chances are, you’re reaching for a pen to sketch a quick diagram or gesturing with your hands to illustrate your point. This combination of verbal and visual thinking is fundamental to how we process and communicate information. Yet until recently, most artificial intelligence systems have been limited to thinking in text alone — imagine trying to solve a maze or rearrange furniture while blindfolded! That’s been changing with the rise of multimodal Large Language Models (LLMs) that can process both text and images, audio, and video. But simply being able to recognize and describe images isn’t enough for truly intelligent visual reasoning. New research from Microsoft and the University of Cambridge demonstrates a new approach called Multimodal Visualization-of-Thought (MVoT) that enables AI to “think visually”, generating its own visual representations as it solves problems — much like how you might sketch out ideas while thinking through a challenging problem. This approach could improve AI reasoning and provide a powerful platform for embodied intelligence in robotics.

The Human Brain’s Secret Weapon: Multimodal Thinking

Our brain is a multimodal processing machine, constantly weaving together information from different channels — words, images, sounds, and spatial relationships — to understand and interact with the world. This ability is a fundamental feature that makes human cognition so powerful. When we’re trying to solve a complex problem, our brain doesn’t just process abstract words and concepts — it creates rich mental images, manipulates spatial relationships, and even simulates physical actions in space. You can instantly understand how to navigate through a crowded room or catch a falling object — your brain is processing multiple streams of information simultaneously and integrating them into a coherent understanding.

This multimodal thinking is often active we’re tackling complex problems. Have you noticed how naturally you reach for a pen and paper (or stylus) when planning a room layout or trying to explain directions? We instinctively create visual aids because our brains are wired to combine verbal and visual thinking for enhanced problem-solving. Even when we’re not physically drawing, our minds generate mental images to help us reason through spatial problems. A chess master can visualize potential moves, an architect can mentally walk through a novel building design, and a mathematician often “sees” geometric relationships before formally proving them. This natural integration of visual and verbal thinking isn’t just helpful — it’s often necessary for handling complex spatial and abstract problems that would be incredibly difficult to solve through words alone.

The Limitation of Words Alone

Despite their impressive capabilities, text-only language models are limited when it comes to tasks that humans naturally solve through visual thinking. Imagine trying to explain exactly how to tie a shoelace or assemble a piece of furniture using only words — while it’s technically possible, it’s remarkably inefficient and prone to misunderstanding (it’s also why most furniture instructions have visual representations of each step). This same challenge affects text-only AI models, especially when they encounter spatial reasoning tasks. While these models can process complex verbal descriptions like “move three spaces up, then two spaces left, avoiding the obstacle at coordinates (2,3),” they struggle to build an accurate mental model of the space and potential consequences of actions.

This limitation becomes particularly apparent in real-world scenarios like robotic navigation, architectural planning, or even simple tasks like organizing objects in a space. For example, when a text-only model tries to solve a maze or plan a safe path through a room with obstacles, it must maintain and update a complex verbal representation of spatial relationships that humans would naturally visualize. This reliance on purely verbal thinking not only makes these models more error-prone but also limits their ability to discover creative solutions that might be immediately obvious through visual reasoning.

Enter MVoT: Teaching AI to Think Visually

Microsoft and University of Cambridge researchers have improved visual reasoning with their Multimodal Visualization-of-Thought (MVoT) approach. Instead of forcing AI to reason about spatial problems through text alone, MVoT enables AI models to generate visual thoughts alongside verbal reasoning — similar to how you might sketch ideas while talking through a problem. The system doesn’t just process images; it actively creates visual representations of its thinking process, using them as steppingstones toward a solution. It’s like watching someone solve a puzzle: they don’t just talk to themselves about where pieces might go, they also point, gesture, and sometimes even draw out their ideas.

The MVoT process.

The magic of MVoT is how it integrates verbal and visual thinking. Just as we might alternate between sketching and explaining while solving a problem, MVoT generates sequences of text explanations and visual representations. For example, when navigating a complex environment, instead of just saying “move forward two spaces,” the system visualizes what that move would look like and uses this visual information to plan its next steps. It’s like how humans might draw out each step of a solution, using their sketches both to check their thinking and to help decide what to do next. The researchers found that this combined approach of thinking in both words and images made the AI significantly more robust and reliable, especially in complex scenarios where text-only reasoning can come up short.

Combining verbal and visual thoughts.

Putting It to the Test: Real Examples

The researchers tested MVoT with three increasingly complex spatial reasoning tasks that mirror real-world challenges. In the simplest scenario, MAZE, the AI had to navigate through a grid-based maze to reach specific destinations — similar to how a robot might need to plan a path through a building. The second task, MINIBEHAVIOR, raised the stakes by requiring the AI to manipulate objects in its environment, specifically picking up a printer and placing it on a table. The most challenging test, FROZENLAKE, required careful navigation through a dangerous environment where one wrong move could lead to failure — imagine planning a route across thin ice while avoiding weak spots.

Results on the FROZENLAKE benchmark.

MVoT handled complex situations where traditional text-only methods stumbled. In the FROZENLAKE task, for instance, text-only models struggled to keep track of multiple hazards and often made fatal mistakes due to imprecise spatial reasoning. MVoT, on the other hand, excelled by actually visualizing each potential move and its consequences, much like how a human might sketch out different paths to find the safest route. The performance improvements were significant — MVoT outperformed traditional methods by over 20% in the most challenging scenarios. What’s particularly interesting is that MVoT’s visual thinking approach proved especially valuable as the environments became more complex. While text-only models saw their performance plummet as complexity increased (dropping from 94% to 39% accuracy as grid sizes grew), MVoT maintained consistently strong performance by leveraging its ability to “think” visually about each situation. There seems to be a real advantage to visual reasoning.

Why This Matters for the Future

This research demonstrates a shift in how AI systems might approach complex reasoning tasks, moving closer to the natural human ability to think across multiple modes. Just as the addition of visual capabilities dramatically enhanced our ability to communicate with AI, teaching AI to think visually could increase its problem-solving capabilities. The biggest benefits will be seen in real-world applications where spatial reasoning and physical interaction are essential — from robotic navigation and manipulation to architectural design and urban planning.

Some interesting applications include autonomous vehicles that can better visualize and plan complex maneuvers in traffic, manufacturing robots that can adapt to new assembly tasks by visualizing different approaches, or AI assistants that can help design and optimize space utilization in buildings or warehouses. The ability to think visually could also make AI systems better collaborators in creative fields, where the ability to sketch and iterate on ideas is crucial. Imagine working with an AI that can not only understand your verbal descriptions but also generate and refine visual concepts alongside you, much like collaborating with a human colleague who can sketch their ideas in real time.

However, there are still challenges to overcome with visual reasoning. Current limitations include the computational overhead required for generating visualizations and the occasional tendency to generate irrelevant visual details (a visual hallucination). The technology also needs to become more efficient and scalable before it can be widely deployed in real-world applications. Visual reasoning is real-time will require huge amounts of computation. Yet the path is clear: as these systems continue to evolve, we’re likely to see AI that can engage with us in increasingly natural and intuitive ways, combining verbal and visual thinking just as humans do. We will see AI systems that are not only more capable but also more relatable and easier to work with, as their thought processes become more aligned with human cognitive patterns. The future might see AI collaborators that can truly think alongside us, sketching out ideas, visualizing solutions, and helping us solve complex problems in ways that feel natural and intuitive.

Conclusion

The development of MVoT is a step toward AI systems that can think more like humans do — by seamlessly combining verbal and visual reasoning. Just as we naturally sketch diagrams or create mental images when solving complex problems, this new generation of AI can generate and use visual representations to enhance its problem-solving capabilities. The impressive results in spatial reasoning tasks highlight a crucial insight: true intelligence isn’t just about processing words or images in isolation but about integrating different modes of thinking in flexible and powerful ways. This ability to think across multiple modalities will become increasingly important, potentially leading to systems that can not only solve more complex problems but also collaborate with humans in more natural and intuitive ways. The future of AI may not lie in becoming better at any single type of processing, but in learning to combine different types of thinking — just as the human mind does so effortlessly.

--

--

Greg Robison
Greg Robison

Written by Greg Robison

With a Ph.D. in cognitive development and background in neuroscience, I bring a human-centric view to AI, whether theory, tools, or implications.

No responses yet