Large language models are advanced AI systems that can understand and create human-like text using a lot of data. In healthcare, they help with many tasks, such as answering patient questions, assisting with clinical notes, supporting clinical decisions, and helping with medical education like exam prep. But there are still many challenges in testing them before using them widely, especially in hospitals, where mistakes can be serious.
Many studies (about 95%) use made-up or simple data like exam questions or example cases instead of real patient information. Only about 5% of studies use real clinical data. This means we don’t fully understand how these models work in actual practice. Patient information and hospital workflows in the US are complex, so the models need detailed testing using real medical records.
Groups like Stanford Human-Centered Artificial Intelligence (Stanford HAI) work on setting real-world tests for healthcare AI. They focus on continuous checks in real clinical settings to lower risks from wrong or unsafe AI advice. For example, a STATnews report found some AI messages to patients gave risky advice, showing the need for careful testing before using these tools in patient communication.
Testing large language models means more than just checking if they give right answers on medical exams or diagnoses. A good testing process looks at many points:
It is also important to test the AI in real clinical workflows, which are often complex. Most studies use tasks with clear, fixed answers, but real healthcare has open-ended questions and multiple types of data, like text and images. Testing should match these real-world complexities.
Hospitals and clinics in the US have many rules meant to keep patients safe and protect privacy. AI tools like large language models must follow these rules. A good evaluation framework means having clear, repeatable steps to check AI performance on:
Researchers like Michael Wornow, Jason Fries, and Nigam Shah say AI systems need constant evaluation using feedback from real users. This way, they can be improved step by step. AI changes fast, and this method helps keep up better than slow traditional tests.
Making these frameworks work requires teamwork. IT staff must keep data safe and help connect the AI. Doctors give advice on what to test. Administrators manage resources and rules.
One useful application of large language models is helping with front-office tasks like scheduling appointments, answering patient calls, handling billing questions, and making referrals. These non-medical tasks take up a lot of time and add to doctor burnout. Studies and surveys from the American Medical Association (AMA) show that automating these could help doctors spend more time with patients.
Simbo AI is a company that uses AI to handle these front-office phone tasks. Their AI agents understand patient requests, reply correctly, and manage scheduling or information without needing a person. This lowers the time doctors and staff spend on routine phone calls.
By automating these tasks, US healthcare providers can reduce phone hold times, make patients happier, and cut down on costs. This also helps reduce burnout by taking away repetitive work from both clinical and office staff.
However, the AI used must be reliable and safe. Automation must be checked for accuracy and follow privacy laws like HIPAA. This means evaluation systems must keep watching AI performance in different real-world situations.
Even with benefits, there are problems when using large language models in healthcare:
Stanford HAI has made progress with tools that grade LLMs and create benchmarks. Their “Constitutional AI” uses AI to spot and flag problems, which helps reduce the manual work doctors need to check AI results.
Healthcare in the US is very diverse, with many kinds of hospitals, clinics, and specialty centers in cities or rural areas. Evaluation methods must reflect this variety. For example:
Most research has not tested LLMs enough in these areas. This raises questions about whether one general AI model can safely cover all specialties.
Also, different states in the US have unique rules and insurance billing needs. These differences should be part of evaluations for AI in administrative tasks.
Using large language models in US healthcare could improve both clinical decisions and office work. But careful, step-by-step checking of safety, fairness, trustworthiness, and workflow fit is needed. With well-planned testing, continuous monitoring, and gradual use, hospitals and clinics can better handle AI in patient care and administration.
Challenges include safety errors in AI-generated responses, lack of real-world evaluation data, the emergent and rapidly evolving nature of generative AI that disrupts traditional implementation pathways, and the need for systematic evaluation to ensure clinical reliability and patient safety.
Systematic evaluation ensures LLMs are accurate, safe, and effective by rigorously testing on real patient data and diverse healthcare tasks, preventing harmful advice, addressing bias, and establishing trustworthy integration into clinical workflows.
Most evaluations use curated datasets like medical exam questions and vignettes, which lack the complexity and variability of real patient data. Only 5% of studies employ actual patient care data, limiting the real-world applicability of results.
LLMs mostly focus on medical knowledge enhancement (e.g., exams like USMLE), diagnostics (19.5%), and treatment recommendations (9.2%). Non-clinical administrative tasks—billing, prescriptions, referrals, clinical note writing—have seen less evaluation despite their potential impact on reducing clinician burnout.
By automating non-clinical and administrative tasks such as patient message responses, clinical trial enrollment screening, prescription writing, and referral generation, AI agents can free physicians’ time for higher-value clinical care, thus reducing workload and burnout.
Important dimensions include fairness and bias mitigation, toxicity, robustness to input variations, inference runtime, cost-efficiency, and deployment considerations, all crucial to ensure safe, equitable, and practical clinical use.
Since LLMs replicate biases in training data, unchecked bias can perpetuate health disparities and stereotypes, potentially causing harm to marginalized groups. Ensuring fairness is essential to maintain ethical standards and patient safety.
Use of AI agent evaluators guided by human preferences (‘Constitutional AI’) allows automated assessment at scale, such as detecting biased or stereotypical content, reducing reliance on costly manual review and accelerating evaluation processes.
Different medical subspecialties have unique clinical priorities and workflows; thus, LLM performance and evaluation criteria should be tailored accordingly. Some specialties like nuclear medicine and medical genetics remain underrepresented in current research.
Recommendations include developing rigorous and continuous evaluation frameworks using real-world data, expanding task coverage beyond clinical knowledge to administrative functions, addressing fairness and robustness, incorporating user feedback loops, and standardizing evaluation metrics across specialties.