Large Language Models (LLMs) have done well in many healthcare and other tasks. Research by Chihung Lin PhD and Chang-Fu Kuo MD, PhD shows these models can match or do better than humans on medical exams. In areas like dermatology, radiology, and ophthalmology, LLMs help by understanding text and sometimes images to support diagnosis. They also help doctors by finding important information in notes and other text, which saves time and reduces paperwork.
LLMs can also help patients understand their health better. They explain medical ideas in clear and kind ways, which helps patients follow treatment plans and stay involved in their care.
Still, even with these advances, using LLMs in healthcare has limits that need close attention.
A review of 519 studies on healthcare LLMs found that only 5% tested models with real patient care data. Most used exam questions or made-up patient stories. This makes it hard to know if the results hold true in actual medical settings.
In the United States, medical offices and electronic health record (EHR) systems vary a lot. Because of this, doctors and managers are careful about using LLMs without real-world testing. For example, the MedAlign study looked at LLM answers tied to real EHR data. It found LLMs might be helpful but need strict, ongoing checks to avoid unsafe advice.
News reports have also shown safety concerns, like AI giving dangerous advice to patients. These problems make healthcare workers less willing to fully trust LLM tools without careful supervision.
Different medical fields use different information, terms, and workflows. Many LLM studies focus only on common fields like general medicine or radiology.
Some specialties, like nuclear medicine, physical medicine, and medical genetics, have complex work that LLMs may not handle well. They need special knowledge and careful attention to detailed diagnosis and treatment steps that general LLM training might miss.
Also, some specialties use unique ways to write and talk about cases. Psychiatry depends on patient stories and feelings. Pediatrics tracks growth and development closely. Allied health workers have their own terms for therapy and rehab. Without special training and testing, LLMs might give wrong or off-topic answers, risking patient care.
So, medical managers in the U.S. should know one model does not fit all. Choosing or testing LLMs should match the needs of each specialty.
Checking how well LLMs work in healthcare is not simple. Important factors include:
Accuracy is checked a lot, but fairness and bias are also very important and often get less attention. LLMs learn from data that might include unfair biases from society. This can cause unfair treatment based on race, gender, or income. Some studies use a method called “Constitutional AI,” where AI guided by human ethics checks for bias. But this is still being developed.
Also, there is no agreement on ways to measure LLM quality. Metrics like ROUGE or BERTScore work well for general language tasks but often miss serious medical mistakes, like made-up facts or missing key details. This means AI can look good on tests but still cause problems in real healthcare.
No matter how good AI gets, health experts are needed to review AI outputs before using them with patients. In the U.S., laws like HIPAA and clinical rules require careful control and responsibility.
Training clinicians is important for:
Doctors, AI creators, and healthcare managers working together helps AI support doctors instead of replacing them.
LLMs do more than help with medical decisions. They also automate office tasks. This includes answering phones, handling patient questions, booking appointments, refilling prescriptions, and writing notes. For example, Simbo AI uses AI to handle front-office phone calls. This can improve patient communication and lower work for staff.
Many U.S. doctors feel burned out because of too much paperwork. AI can take over routine tasks so staff can focus more on patients and make operations smoother.
One helpful tool is AI “ambient scribes” that listen to doctor-patient talks, write notes, and put them into EHRs. Studies in 2023 and 2024 show these scribes can cut visit time by over 26%, reduce note-taking by more than 50%, and lower after-work paperwork by 61%. Providers say these tools help reduce mental stress and improve connection with patients.
But studies on AI scribes mostly use fake data and test few medical fields. Good evaluations need both computer scores and doctor reviews to be sure the tools work well and safely. U.S. managers choosing AI tools should consider privacy, ease of use, specialty fit, and secure integration.
When using AI, healthcare offices in the U.S. must follow strict rules. HIPAA protects patient privacy and controls how health data is used and shared. AI must keep data safe in both training and daily use.
Tests must check privacy and security often. Staff should be trained to spot AI mistakes fast and have plans to stop or correct AI advice when needed.
To improve LLMs in U.S. healthcare, future work should include more specialties like rare diseases, medical genetics, nuclear medicine, and allied health. Making models and tests fit each specialty will improve safety and usefulness.
Also, combining text, images, and other data types in LLMs can help with better diagnosis and treatment. For example, using CT scans with patient records might help doctors diagnose cancer or brain problems better.
The “Constitutional AI” method, where human values guide AI evaluation, may help monitor bias and safety without requiring too much manual effort. Medical managers and IT leaders should work with AI companies ready to include doctor feedback when updating models.
Hospital leaders, clinic owners, and IT managers in the United States should keep these points in mind when using LLMs:
Careful and responsible testing and use of LLMs will help make their benefits safe in U.S. healthcare. From meeting specialty needs to ethics and workflow, medical leaders should adopt AI carefully to improve patient care and operations.
The main purpose is to enhance the efficiency of handling patient inquiries specific to specialties, reducing physician workload and improving patient interactions.
Current AI models, particularly LLMs, have shown safety errors in responses, which can pose risks to patient care.
Real patient care data helps assess the AI’s performance accurately, ensuring its responses are relevant and safe for clinical use.
Common tasks include medical knowledge enhancement, diagnostics, and treatment recommendations, but there is less focus on administrative tasks that can alleviate physician burnout.
Challenges include a lack of consensus on evaluation dimensions, reliance on curated data instead of real-world cases, and the need for more specialty-specific evaluations.
‘Constitutional AI’ refers to the use of AI agents that adhere to human-defined principles, aiming to improve the evaluation and decision-making process.
Utilizing agents with human preferences can automate evaluation processes, allowing for faster and more efficient assessments of AI outputs.
Specialties like nuclear medicine, physical medicine, and medical genetics are notably underrepresented, yet they require tailored evaluations due to their unique priorities.
Key dimensions include accuracy, fairness, bias, robustness, and considerations for real-world deployment such as inference runtime and cost efficiency.
With improved evaluation, training on diverse real-world data, and attention to specialty-specific needs, LLMs could significantly reduce physician workload and enhance patient outcomes.