Design and implementation of real-world benchmarks for evaluating large language models in multi-step electronic health record tasks

Medical practices are using new technology tools to handle the growing complexity of clinical workflows and patient data.
Large language models (LLMs) have gained attention because they might help clinicians and administrative staff with electronic health record (EHR) tasks.
But to use LLMs well in everyday medical work, healthcare administrators, owners, and IT managers need to know how these models work on complex, multi-step tasks involving patient information.

In this article, we talk about how real-world benchmarks are designed and used to test LLMs in multi-step EHR workflows in the U.S. medical system.
These benchmarks are very important to check the reliability, safety, and efficiency of AI agents before they are used in clinical settings.
The article covers the latest research, datasets, evaluation challenges, and workflow automation practices that can help healthcare leaders make informed decisions.

The Need for Real-World Benchmarks in Healthcare AI

Large language models have grown from tools that answer simple questions to agents that can think through many steps and work with clinical systems.
Unlike older tests with fixed questions, real medical practice needs systems that handle ongoing tasks like getting patient data, documenting, managing medication, and making referrals.
These multi-step tasks are like the workflows seen in hospitals and clinics.

In medical administration, patient safety and rules are very important.
AI tools must meet strict rules to avoid errors or biases that could harm patients.
Old AI tests that use fixed questions or one-step answers do not test models in real, complex workflows.
There is a need for tests that copy real clinical environments and check how well AI tools work in ongoing healthcare tasks.

Stanford’s MedAgentBench: A Step Towards Realistic AI Evaluation

Stanford University researchers created MedAgentBench, a benchmark suite made to test LLM agents in a real EHR setting.
It uses 100 de-identified patient profiles from Stanford’s STARR data, which includes over 700,000 clinical records like lab results, vital signs, diagnoses, procedures, and medication orders.

MedAgentBench has 300 tasks written by clinicians covering ten medical categories.
These tasks mimic situations seen in inpatient and outpatient care, such as tracking labs, retrieving patient info, ordering tests, making referrals, and managing medication.
The environment follows the Fast Healthcare Interoperability Resources (FHIR) standard.
This lets AI agents not only read data (GET) but also change EHR records (POST), which is like real clinical actions.

The evaluation uses a strict pass@1 success rate metric.
This checks if the AI finishes a task correctly on the first try, showing the accuracy and safety needed in healthcare.
The AI must follow instructions exactly, use correct data formats, and make outputs that can be put directly into healthcare systems without errors.

In tests, Claude 3.5 Sonnet v2 scored highest with a 69.67% success rate.
It was strong in data retrieval with an 85.33% task success rate.
GPT-4o came next with 64%.
These results show that many AI models do well on simple queries but find it hard to handle multi-step workflows safely.
Common problems include wrong API calls, bad JSON formatting, and giving text when structured data is needed.

Rapid Turnaround Letter AI Agent

AI agent returns drafts in minutes. Simbo AI is HIPAA compliant and reduces patient follow-up calls.

Start Now

Challenges in Evaluating LLMs for Healthcare Workflows

A major problem in using LLMs in healthcare administration is the complexity and privacy of the tasks.
Clinical settings require AI to respect patient privacy, keep data accurate, and fit smoothly with existing EHR systems.

Current evaluation methods must balance fast automated checks for accuracy and data format with expert human reviews of how well the AI thinks and whether its answers fit the situation.
Good evaluation also needs to mimic multi-round conversations where AI talks, makes decisions, and manages records over several steps.

Many healthcare datasets come from a few hospitals, so models trained on them might not work well for all kinds of patients across the country.
For example, Stanford’s STARR data is good but only covers one system’s workflow.
New datasets and benchmarks are being made to cover wider needs.

Beyond Single Institutions: Longitudinal EHR Datasets and Broader Benchmarks

To build models that can help with long-term patient care, datasets that follow patients over time are needed.
Stanford HAI released three such datasets: EHRSHOT, INSPECT, and MedAlign, covering almost 26,000 unique patients and close to 300 million clinical events.
These datasets include many visits, detailed clinical notes, diagnosis codes, and medical images, offering more context to the AI models.

The datasets use the OMOP CDM 5.4 format and the Medical Event Data Standard (MEDS) for faster work.
These formats help researchers build models that can understand events across many doctor visits, which improves AI’s help in managing chronic diseases and care plans.

Access to these datasets needs strict data use agreements and training on research ethics.
This keeps the use of sensitive health data safe and legal, which is very important for medical managers who handle AI projects.

Multi-Step Clinical Diagnostic Benchmarks: The MSDiagnosis Framework

Clinical diagnosis often happens in multiple stages: first guessing a main problem, then thinking of alternatives, and finally confirming the diagnosis.
Many AI tests are too simple and do not include all these steps.
Chinese researchers created MSDiagnosis, a big multi-step diagnostic benchmark.

MSDiagnosis has 2,225 medical records from 12 departments, showing complex diagnostic workflows.
It has two parts: the forward inference module finds similar past cases to help diagnose, and the backward inference and refinement module checks the diagnosis against patient data to lower errors.

The evaluation focuses on clinical accuracy rather than usual language scores like BLEU or ROUGE.
It marks evidence linked to diseases and checks if AI diagnoses match real patient signs.
Although based on Chinese data, this framework offers useful ideas for U.S. practices wanting to check LLMs in diagnostic support with EHR data.

Automate Medical Records Requests using Voice AI Agent

SimboConnect AI Phone Agent takes medical records requests from patients instantly.

Start Now →

AI and Workflow Automation: Enhancing Clinical Office Efficiency

In U.S. medical offices, admin work often uses many resources and cuts into time for patient care.
Front-office tasks like phone triage, appointment scheduling, and patient messaging are important and affected by technology.
AI offers solutions here, and companies like Simbo AI focus on phone automation using conversational AI.

Linking AI with EHR systems can automate multi-step processes.
For example, when a patient calls, AI can get patient info, check schedules, and update notes while following rules.
This goes beyond simple chatbots by joining AI with clinical data and admin work, lowering errors and freeing staff to focus on complex tasks.

Benchmarks like MedAgentBench and AI Structured Clinical Examinations (AI-SCE) help test if AI can safely handle multi-turn talks with patients and clinical tools.
These systems mimic real clinics, including limits on conversation rounds and strict output formats, making sure the AI is ready for real use.

Healthcare managers and IT staff should check AI systems with these benchmarks to be sure they work well.
Testing with real multi-step workflows helps confirm if AI can handle office pressures and keep patients safe and data correct.

AI Call Assistant Manages On-Call Schedules

SimboConnect replaces spreadsheets with drag-and-drop calendars and AI alerts.

The Role of Interdisciplinary Collaboration in Benchmark Development

Building strong benchmarks for clinical LLMs needs teamwork from healthcare workers, medical researchers, and computer scientists.
This helps create tests based on real work needs, with technical depth and safety rules.

For example, at UC San Diego Health, teams are working to put GPT-4 into the MyChart patient portal.
Clinicians check that AI messages and summaries are correct and reliable.
Stanford’s work also shows that teamwork creates standards like FHIR compliance, linking clinical use and computing needs.

U.S. practices thinking about AI should support such teamwork,
making sure vendors build or test systems with people who understand both healthcare and AI limits.
This helps avoid mistakes in multi-step tasks, where wrong data use or interpretation could cause problems.

Evaluating AI for EHR Tasks in a Regulated Environment

Healthcare in the U.S. is highly regulated with laws like HIPAA.
AI systems working with EHRs must follow these rules, protecting data security, patient privacy, and tracking access.

Benchmarks like MedAgentBench and longitudinal datasets focus not just on AI performance but also on compliance by using de-identified data and requiring verified researcher access.
This shows that AI tools made and tested with these datasets can protect sensitive health info, an important point for U.S. medical managers.

Also, using datasets and benchmarks based on standards like FHIR and OMOP CDM helps AI work smoothly with common EHR systems such as Epic, Cerner, or Allscripts.
This lowers technical problems and costs when putting AI into real use.

Recent Trends in Large Language Model Performance in Healthcare

Recent research shows LLMs are getting closer to human expert level in some clinical tasks.
For instance, Google’s Med-PaLM 2 and GPT-4 have scored about 85% on medical licensing tests like the USMLE, showing better diagnosis and reasoning skills.

Still, multi-step clinical workflows remain hard.
Models may do well on closed questions but struggle to combine information across many patient visits or handle complex lab orders, medicines, and referrals.

Real-world benchmarks are important for moving LLM accuracy into useful clinical practice.
They help find problems like hallucination, where AI makes believable but wrong info.
Benchmarks also guide improving models through retraining, human review, and changing workflows.

Practical Considerations for Medical Administrators and IT Managers

  • Task Complexity: Check if the AI can do multi-step workflows common in your practice, not just simple questions.
  • Benchmark Validation: Choose AI tools tested on real clinical benchmarks like MedAgentBench or AI-SCE.
  • Standards Compliance: Make sure the AI follows standards such as FHIR, OMOP CDM for easy integration and future growth.
  • Data Privacy and Security: Confirm training and testing datasets follow HIPAA rules and review how vendors manage data.
  • Interdisciplinary Engagement: Include clinicians, IT, and legal experts when evaluating and deploying AI to balance ease of use, safety, and laws.
  • Workflow Impact: Understand how AI will change workflows, spotting chances to automate repetitive tasks like phone triage or documentation without disturbing care.
  • Monitoring and Post-Deployment Support: Set up ongoing checks to catch AI errors or drops in performance after deployment, with clear ways to get clinical help when needed.

Medical practices in the United States are beginning to use AI more in clinical administration.
Designing and using real-world benchmarks to evaluate LLMs in multi-step EHR tasks is an important move to make sure these tools are safe and useful.
By understanding the challenges and testing methods described here, healthcare administrators, owners, and IT managers can make better decisions about AI.
This supports safer and more efficient patient care.

Frequently Asked Questions

What is MedAgentBench and who introduced it?

MedAgentBench is a real-world benchmark suite developed by Stanford University researchers designed to evaluate large language model (LLM) agents in healthcare settings through interaction with virtual EHR systems and multi-step clinical tasks.

Why is an agentic benchmark needed in healthcare AI?

Healthcare requires agentic AI benchmarks because traditional datasets test static reasoning, whereas agentic AI can interpret instructions, call APIs, and automate complex workflows, addressing staff shortages and administrative burdens in clinical environments.

What kinds of tasks does MedAgentBench include?

It contains 300 tasks across 10 medical categories such as patient data retrieval, lab tracking, documentation, test ordering, referrals, and medication management, each typically involving 2-3 multi-step actions.

What patient data supports MedAgentBench?

The benchmark uses 100 de-identified patient profiles from Stanford’s STARR repository with over 700,000 records, including labs, vitals, diagnoses, procedures, and medication orders, maintaining clinical validity.

How is the MedAgentBench environment designed?

It is FHIR-compliant, supporting both GET and POST operations on EHR data, allowing AI agents to simulate real clinical actions like documenting vitals and placing medication orders.

How are AI models evaluated using MedAgentBench?

Models are evaluated on task success rate (SR) using a strict pass@1 metric to ensure safety, with a baseline orchestration using nine FHIR functions and limited to eight rounds of interaction per task.

Which AI models performed best on MedAgentBench?

Claude 3.5 Sonnet v2 led with 69.67% success, excelling in retrieval (85.33%), followed by GPT-4o at 64.0%, and DeepSeek-V3 at 62.67% among open-weight models.

What common errors do healthcare AI agents make in MedAgentBench?

Two main failure types are instruction adherence failures, such as invalid API calls or incorrect JSON formatting, and output mismatch where agents produce full sentences instead of required structured numerical values.

What are the limitations of MedAgentBench?

Limitations include reliance on data from a single institution and a focus mainly on EHR-related workflows, which may limit generalizability across varied healthcare systems and tasks.

How does MedAgentBench contribute to future healthcare AI development?

It provides an open, reproducible, clinically relevant framework that measures real-world agentic AI performance beyond static QA, guiding improvements toward dependable AI agents for live clinical environments.