Stanford Researchers Introduced Biomni: A Biomedical AI Agent for Automation Across Diverse Tasks and Data Types

Biomedical research is a rapidly evolving field that seeks to advance human health by uncovering the mechanisms behind diseases, identifying new therapeutic targets, and developing effective treatments. This field encompasses diverse areas, including genetics, molecular biology, pharmacology, and clinical studies, which require specialized tools and in-depth expertise. The increasing complexity of biomedical data, experiments, and literature has created both opportunities and challenges. Researchers must integrate findings from genomics, proteomics, and other data sources to generate hypotheses, design experiments, and interpret results. The ability to efficiently manage this complexity is crucial for accelerating scientific discovery and translating findings into clinical applications.

The core challenges in biomedical research are the sheer volume of data, methods, and tools that must be managed to produce meaningful results. Researchers often face fragmented workflows, relying on numerous specialized tools that don’t integrate well with each other. This creates bottlenecks when attempting to design experiments, process large datasets, or interpret multimodal biomedical information. The problem is further compounded by the fact that expert human researchers are limited in availability, making it difficult to keep pace with the growing body of scientific knowledge. As a result, significant portions of biomedical data remain underutilized, and connections between findings across different subfields are often missed. Addressing these concerns requires a new approach that can scale expertise, handle data complexity, and support integrated workflows across various biomedical domains.

Existing tools for biomedical research often focus on narrow tasks such as specific gene analysis, protein structure prediction, or drug-target interaction studies. These tools require careful setup, domain-specific knowledge, and manual integration into broader workflows. While large language models (LLMs) have shown promise in tasks like biomedical question answering, they cannot typically interact with specialized tools or databases directly. Past efforts to create AI agents for biomedical tasks have relied on predefined workflows or templates, limiting their flexibility. Consequently, researchers have struggled to find AI systems that can adapt to diverse biomedical tasks, dynamically compose new workflows, or execute complex analyses end-to-end.

Researchers from Stanford University, Genentech, the Arc Institute, the University of Washington, Princeton University, and the University of California, San Francisco, introduced Biomni, a general-purpose biomedical AI agent. Biomni combines a foundational biomedical environment, Biomni-E1, with an advanced task-executing architecture, Biomni-A1. Biomni-E1 was constructed by mining tens of thousands of biomedical publications across 25 subfields, extracting 150 specialized tools, 105 software packages, and 59 databases, forming a unified biomedical action space. Biomni-A1 dynamically selects tools, formulates plans, and executes tasks by generating and running code, enabling the system to adapt to diverse biomedical problems. This integration of reasoning, code-based execution, and resource selection allows Biomni to perform a wide range of tasks autonomously, including bioinformatics analyses, hypothesis generation, and protocol design. Unlike static function-calling models, Biomni’s architecture allows it to flexibly interleave code execution, data querying, and tool invocation, creating a seamless pipeline for complex biomedical workflows.

Biomni-A1 uses an LLM-based tool selection mechanism to identify relevant resources based on user goals. It applies code as a universal interface to compose complex workflows with procedural logic, including loops, parallelization, and conditional steps. An adaptive planning strategy enables Biomni to iteratively refine plans as it executes tasks, ensuring context-aware and responsive behavior. Biomni’s performance has been rigorously evaluated through multiple benchmarks. On the LAB-Bench benchmark, Biomni achieved 74.4% accuracy in DbQA and 81.9% in SeqQA, outperforming human experts (74.7% and 78.8%, respectively). On the HLE benchmark covering 14 subfields, Biomni scored 17.3%, outperforming base LLMs by 402.3%, coding agents by 43.0%, and its own ablated variant by 20.4%. Real-world case studies demonstrated Biomni’s ability to autonomously generate 10-step pipelines analyzing 458 wearable sensor files autonomously, identifying a postprandial temperature increase of 2.19°C across individuals. It also analyzed 227 nights of sleep data, uncovering insights such as mid-week peaks in sleep efficiency and the importance of circadian regularity over total sleep duration.

Biomni’s ability to handle real-world research questions extends to complex multi-omics analyses, where it processed over 336,000 single-nucleus RNA-seq and ATAC-seq profiles from human embryonic skeletal data. Biomni constructed a 10-stage analysis pipeline to predict transcription factor-target gene links, filter results using chromatin accessibility data, and summarize findings in a structured report. The agent handled all aspects of the analysis, including code generation, error debugging, and results interpretation, producing outputs such as trajectory plots, heatmaps, and PCA biplots. These capabilities demonstrate Biomni’s capacity to manage large-scale, multi-modal datasets, identify biological patterns, and accelerate the path from raw data to testable hypotheses. By executing between 6 and 24 distinct steps per task, integrating up to 4 specialized tools, eight software packages, and three unique data lake items, Biomni mirrors the workflows of human scientists while drastically reducing manual effort.

Several Key Takeaways from the Research on Biomni include:

Biomni-E1 comprises 150 specialized tools, 105 software packages, and 59 databases, all of which are integrated for biomedical research.
Biomni’s average performance gain: 402.3% over base LLM, 43.0% over coding agent, and 20.4% over Biomni-ReAct.
Biomni autonomously executed a 10-step pipeline analyzing 458 wearable sensor files, revealing a 2.19°C average postprandial temperature rise.
On the LAB-Bench benchmark, Biomni achieved 74.4% accuracy in DbQA and 81.9% in SeqQA, outperforming human experts.
Biomni handled a complex multi-omics dataset of 336,162 profiles and generated interpretable outputs, including gene regulatory networks and motif enrichment analyses.
Average task execution involves 6-24 steps, using up to 4 tools, eight software packages, and 3 data lake items.
Biomni’s flexible architecture enables it to generate PCA plots, heatmaps, trajectory plots, and cluster maps autonomously, producing human-readable reports without manual intervention.

In conclusion, Biomni represents a major step forward in biomedical AI, combining reasoning, code execution, and dynamic resource integration into a single system. The researchers have demonstrated that it can generalize across tasks, execute complex workflows without manual templates, and produce results that rival or exceed human expertise in several areas. The system’s ability to handle large datasets, compose complex pipelines, and generate human-readable reports suggests it has the potential to significantly accelerate biomedical discovery, reduce the burden on researchers, and enable new insights.

Check out the Paper, Code and Try it here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Source link