Insider Brief
- MIT researchers have developed DAAAM, a memory framework that allows robots to build detailed maps of their surroundings and answer natural-language questions about what they have seen, where they saw it and when.
- The system combines robotic mapping with computer vision, enabling robots to attach descriptions to objects and locations and later retrieve that information through language-based queries rather than relying only on coordinates or visual data.
- In testing, DAAAM improved accuracy by 21% to 53% compared with existing methods, and the researchers said the technology could support applications ranging from industrial and warehouse robotics to autonomous inspection systems and augmented reality tools.
MIT researchers have developed a memory system that allows robots to remember detailed information about large environments and retrieve it using natural-language questions.
The study, led by Nicolas Gorlo, a graduate student at MIT, along with Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics and director of the MIT SPARK Laboratory, and Lukas Schmid, a former MIT research scientist who is now a professor at the University of Technology Nuremberg in Germany, was funded in part by the U.S. Army Research Laboratory and the Office of Naval Research.
DAAAM
The researchers call their system Describe Anything, Anywhere, Anytime, at Any Moment, or DAAAM, and it combines robotic mapping with advances in computer vision to create what researchers describe as a form of spatiotemporal memory. The goal is to allow robots not only to navigate an environment but also to remember what they have seen, where they saw it and when.
The work takes on the spatiotemporal memory challenge in robotics. While modern AI systems can recognize objects and understand language, they often struggle to connect those capabilities to detailed memories of real-world environments. Humans routinely perform this task, remembering where they left an item or recalling details about a location visited days earlier.
“If we want robots to work side-by-side with humans and interact better with humans, they must speak the same language,” Carlone noted. “The robot must be able to reason about time and space the same way humans do. That is essentially what our method is doing. It is turning a traditional map into a language-based map that is easier for the robot to think about and access using language.”
Merging Two Technologies
The researchers combined two technologies that are typically developed separately. Computer vision models can generate rich descriptions of objects and scenes but often operate on individual images. Robotic mapping systems can create large-scale three-dimensional maps but usually lack detailed semantic information about what those maps contain.
As a robot moves through an environment, the system creates a 3D map while attaching descriptions to objects it encounters. For example, it can identify a building, recognize objects nearby and store information about their location within the map. Rather than simply recording coordinates, the system builds a searchable memory that links places, objects and descriptions together.
The Processing Delay Problem
A key hurdle was speed, according to the researchers. Existing methods that generate detailed descriptions of objects can take several seconds per scene, making them impractical for robots operating in real time.
To tackle this problem, the researchers developed a method that groups nearby objects and selects only the most useful images for detailed analysis. By processing multiple objects simultaneously, the system reduced computational demands and increased annotation speed by roughly an order of magnitude, researchers pointed out.
Once information is stored, the robot must be able to retrieve it efficiently. The researchers integrated a large language model that uses specialized search tools to locate relevant information within the robot’s memory. Depending on the question, the system can search by object type, location or other contextual information.
In testing, DAAAM outperformed existing approaches, achieving accuracy improvements ranging from 21% to 53%, depending on the type of query.
The researchers envision several applications. In industrial settings, workers could ask robotic assistants to retrieve partially completed components or locate tools. Similar capabilities could support warehouse operations, autonomous inspection systems and service robots.
What’s Next?
Beyond robotics, the framework could be used in augmented reality systems that help maintenance workers identify anomalies or assist people navigating complex environments.
The researchers plan to expand the system to capture significant events in addition to objects and locations. They are also exploring ways to allow robots to express confidence in their answers, which could improve reliability in real-world deployments.
“Ultimately, we want to have robots that can help with any sort of tasks. With this framework, we are trying to create the foundations to enable a generalist agent that can do anything you ask,” Gorlo added.
Image credit: MIT