Artificial Intelligence (AI) is becoming more important in healthcare. It can help improve how clinics work, help patients get better care, and make office tasks easier. But a problem in the U.S. is that many tests for healthcare AI only use data from one hospital or center. This limits how well the AI works in other places.
This article talks about the problems with current AI tests that use data from only one place. It also shares ideas to help AI work better across different hospitals and clinics in the United States. It looks at AI tools that automate work in healthcare, which is becoming more common.
Healthcare AI benchmarks are sets of tests made to check how well AI systems do their jobs. These tests look at tasks like finding patient information, tracking lab tests, writing notes, ordering tests, sending referrals, and managing medicines. One example is MedAgentBench, made by researchers at Stanford University. It uses 100 patient profiles from Stanford’s database, which has over 700,000 anonymous patient records. The benchmark tests AI on 300 tasks designed to show what doctors and nurses do in real life.
AI models like Claude 3.5 Sonnet v2 and GPT-4o were tested using MedAgentBench. They passed about 70% and 64% of the tasks. The tests use a strict scoring method that focuses on safety and reliability, which is very important in healthcare.
Even with these successes, MedAgentBench and similar tests only represent one hospital’s data and work methods—Stanford University Medical Center. This causes problems when trying to use the AI in many different healthcare places around the U.S.
Patients are very different depending on where they live in the United States. A dataset from one hospital might mostly include certain ages, races, social backgrounds, and health conditions. For example, Stanford’s database mainly shows patients from Northern California. It doesn’t represent patients from rural areas in the Midwest or city hospitals in the South.
Because of this, AI models that learn from just one place’s data may not work well for patients in other areas. The models might miss important patterns or make wrong guesses because they haven’t seen many types of patients.
Healthcare centers do things differently. Doctors might record information in different ways. The electronic health record (EHR) systems they use can look very different. Tests and tools available can vary.
An AI trained on one hospital’s data might not work well at another hospital with different forms and rules. For example, Stanford’s MedAgentBench follows special technical standards (called FHIR), but it copies Stanford’s own workflows. These might not fit well with how other hospitals or clinics do things.
A review of how AI is trained in healthcare up to May 2024 found many problems. Often, AI models fit too closely to one hospital’s data. They don’t adjust well for different patient groups. This means the AI works well for the original data but can make more mistakes in other places.
Sharing data between hospitals is hard because of privacy laws like HIPAA in the U.S. Hospitals cannot freely share patient information. This makes it tough to build AI that learns from many different hospitals. It limits how well AI can work in different healthcare settings.
Federated learning is a way to train AI on data from many hospitals without sharing the actual patient data. Instead, hospitals share updates from their models, keeping patient privacy safe.
But many current federated learning projects face big challenges. These include slow communication, errors in method, and biased results. Healthcare leaders should watch for solutions that protect privacy well and prove they work in real clinics.
Experts suggest better privacy tools (like differential privacy), improving how hospitals talk to each other, and checking models carefully to make sure they work on different kinds of patient data.
To make AI stronger, test datasets should come from many healthcare places. These should include different regions, patient groups, and clinical procedures. This is hard to do but very important in the U.S.
Good benchmarks should have data from academic hospitals in big cities, community hospitals, rural clinics, and specialty outpatient centers. This helps AI learn how to handle different ways healthcare is done.
Most AI tests ask simple questions. MedAgentBench is different because it tests actions that need many steps, like ordering lab tests or entering vital signs into computer systems.
Expanding these kinds of tests to include workflows from other hospitals is key. It helps train AI to work safely in real healthcare situations.
Healthcare leaders should ask vendors to show how AI performs in complex tasks typical of their own clinics, not just simple questions.
AI must be tested on data from many different hospitals to make sure it works well everywhere. Testing should be consistent and repeatable.
Investing in benchmark sets that are free and open for all U.S. healthcare groups can help achieve this. Working with universities and AI companies can improve how AI tools are tested on many kinds of data.
Apart from helping with medical decisions, AI is changing front-office work like answering phones and scheduling appointments. Companies like Simbo AI make phone systems that use AI to help reduce administrative tasks.
Practice owners and managers can use AI tools to ease routine work and improve how they communicate with patients. AI phone systems can direct calls correctly, remind patients about appointments, and handle simple questions. This lets clinic staff focus on more important work.
How well AI works in these tasks depends on how well it understands directions and connects with existing electronic health record and management software.
This ability is a key point from Stanford’s research. Their AI models can do many steps and work with APIs (software links) inside clinical systems.
Automating workflow is also important to help with the shortage of staff, especially in rural or underserved places. Workflow AI can make operations more efficient without breaking privacy or security rules.
When looking at AI for workflow automation, administrators should think about:
Healthcare managers and IT staff must balance adopting AI with keeping patients safe and following privacy rules. Choosing AI tools trained only on one hospital’s data can lead to bad results in other places.
By knowing these problems, managers can better check claims from AI vendors. They should look for tools tested with many types of data and strong methods.
Joining efforts that share data carefully, like federated learning, can help hospitals work together to make AI more accurate and useful.
Following these ideas will help bring AI into healthcare more safely and reliably, improving patient care and clinic efficiency across the country.
This article gives healthcare managers and doctors in the U.S. a better understanding of current challenges and ways to improve AI in healthcare. AI is growing fast, but making sure data comes from many types of patients and that AI fits into real workflows is important. This will help AI work well and safely for many healthcare providers.
MedAgentBench is a real-world benchmark suite developed by Stanford University researchers designed to evaluate large language model (LLM) agents in healthcare settings through interaction with virtual EHR systems and multi-step clinical tasks.
Healthcare requires agentic AI benchmarks because traditional datasets test static reasoning, whereas agentic AI can interpret instructions, call APIs, and automate complex workflows, addressing staff shortages and administrative burdens in clinical environments.
It contains 300 tasks across 10 medical categories such as patient data retrieval, lab tracking, documentation, test ordering, referrals, and medication management, each typically involving 2-3 multi-step actions.
The benchmark uses 100 de-identified patient profiles from Stanford’s STARR repository with over 700,000 records, including labs, vitals, diagnoses, procedures, and medication orders, maintaining clinical validity.
It is FHIR-compliant, supporting both GET and POST operations on EHR data, allowing AI agents to simulate real clinical actions like documenting vitals and placing medication orders.
Models are evaluated on task success rate (SR) using a strict pass@1 metric to ensure safety, with a baseline orchestration using nine FHIR functions and limited to eight rounds of interaction per task.
Claude 3.5 Sonnet v2 led with 69.67% success, excelling in retrieval (85.33%), followed by GPT-4o at 64.0%, and DeepSeek-V3 at 62.67% among open-weight models.
Two main failure types are instruction adherence failures, such as invalid API calls or incorrect JSON formatting, and output mismatch where agents produce full sentences instead of required structured numerical values.
Limitations include reliance on data from a single institution and a focus mainly on EHR-related workflows, which may limit generalizability across varied healthcare systems and tasks.
It provides an open, reproducible, clinically relevant framework that measures real-world agentic AI performance beyond static QA, guiding improvements toward dependable AI agents for live clinical environments.