Surgical robots have long been valuable tools in the operating room, assisting surgeons with precision tasks like suturing, holding instruments, or stabilizing organs. However, these systems still depend heavily on direct human control or pre-programmed instructions.

In a perspective piece in Nature Machine Intelligence, a team led by Johns Hopkins engineers proposes using multimodal, general-purpose models—trained on diverse datasets, including video demonstrations—to make surgical robots safer and more efficient. The team says this will equip surgical robots with the adaptability and autonomy needed to handle complex, dynamic procedures, thus reducing the burden on human surgeons.

“One major barrier to achieving fully autonomous surgical robots is the complexity of surgeries themselves,” says lead author Samuel Schmidgall, a graduate research assistant in the Whiting School of Engineering’s Department of Electrical and Computer Engineering. “Surgeons work with soft, living tissue, which changes unpredictably. This makes it hard to simulate real-life conditions when training surgical robots, and they must operate with high levels of safety to avoid patient harm.”

And unlike robots in manufacturing, surgical robots cannot rely solely on trial and error; each step must be meticulously planned to avoid risk to patients. A single “multimodal, multitask” AI training model that combines visual, language, and physical (action-based) data overcomes these challenges, the researchers say.

“This vision-language-action model would allow surgical robots to understand instructions in natural language, interpret what they see in real time, and take appropriate action based on context,” says Schmidgall. “The model could even detect situations that require human intervention, automatically signaling the surgeon when something is unclear.”

Axel Krieger—a Malone Center affiliate, an associate professor of mechanical engineering, and a co-author of the paper, recently demonstrated the potential of this approach by training a surgical robot using video-based datasets. Krieger’s work showcases how multimodal training enables robots to perform intricate tasks autonomously, bridging a gap between theory and practice in surgical robotics.

The emerging multimodal model used in Krieger’s work and discussed in the team’s paper describe the approach’s transformative potential for surgical robotics. By analyzing thousands of procedures, autonomous surgical systems could refine techniques, assist in multiple surgeries simultaneously, and adapt to new scenarios, the authors say, pointing out that such tools could also ease surgeons’ workloads and ultimately improve patient outcomes.

Co-authors of the paper also include Ji Woong Kim, a postdoctoral researcher in the Department of Mechanical Engineering; Ahmed Ezzat Ghazi, an associate professor of urology at the School of Medicine; and Alan Kuntz, an assistant professor at the University of Utah.

Image Caption: A proposed control loop for the autonomous RT-RAS: The surgeon provides action commands as text input to several RT-RAS robots performing different surgeries. The RT-RAS executes these commands while maintaining high confidence while the surgeon oversees each of the robots. A risk-avoidance system is in place, which knows when autonomy needs to be switched to the surgeon when the robot enters a low-confidence region.