Evaluating the Challenges Faced by Large Language Models: Focusing on Specialty-Specific Needs and Real-World Performance

Large Language Models (LLMs) have done well in many healthcare and other tasks. Research by Chihung Lin PhD and Chang-Fu Kuo MD, PhD shows these models can match or do better than humans on medical exams. In areas like dermatology, radiology, and ophthalmology, LLMs help by understanding text and sometimes images to support diagnosis. They also help doctors by finding important information in notes and other text, which saves time and reduces paperwork.

LLMs can also help patients understand their health better. They explain medical ideas in clear and kind ways, which helps patients follow treatment plans and stay involved in their care.

Still, even with these advances, using LLMs in healthcare has limits that need close attention.

The Gap Between Promise and Real-World Application

A review of 519 studies on healthcare LLMs found that only 5% tested models with real patient care data. Most used exam questions or made-up patient stories. This makes it hard to know if the results hold true in actual medical settings.

In the United States, medical offices and electronic health record (EHR) systems vary a lot. Because of this, doctors and managers are careful about using LLMs without real-world testing. For example, the MedAlign study looked at LLM answers tied to real EHR data. It found LLMs might be helpful but need strict, ongoing checks to avoid unsafe advice.

News reports have also shown safety concerns, like AI giving dangerous advice to patients. These problems make healthcare workers less willing to fully trust LLM tools without careful supervision.

AI Call Assistant Skips Data Entry

SimboConnect recieves images of insurance details on SMS, extracts them to auto-fills EHR fields.

Claim Your Free Demo

Specialty-Specific Challenges for LLMs

Different medical fields use different information, terms, and workflows. Many LLM studies focus only on common fields like general medicine or radiology.

Some specialties, like nuclear medicine, physical medicine, and medical genetics, have complex work that LLMs may not handle well. They need special knowledge and careful attention to detailed diagnosis and treatment steps that general LLM training might miss.

Also, some specialties use unique ways to write and talk about cases. Psychiatry depends on patient stories and feelings. Pediatrics tracks growth and development closely. Allied health workers have their own terms for therapy and rehab. Without special training and testing, LLMs might give wrong or off-topic answers, risking patient care.

So, medical managers in the U.S. should know one model does not fit all. Choosing or testing LLMs should match the needs of each specialty.

AI Phone Agents for After-hours and Holidays

SimboConnect AI Phone Agent auto-switches to after-hours workflows during closures.

Claim Your Free Demo →

Evaluating LLMs: Dimensions and Difficulties

Checking how well LLMs work in healthcare is not simple. Important factors include:

  • Accuracy: Does it give correct medical information?
  • Fairness and Bias: Does it treat all patient groups fairly and avoid stereotypes?
  • Robustness: Can it handle unclear or missing data well?
  • Toxicity: Does it avoid harmful or inappropriate language?
  • Speed and Cost: Does it answer quickly without needing too much computing power?

Accuracy is checked a lot, but fairness and bias are also very important and often get less attention. LLMs learn from data that might include unfair biases from society. This can cause unfair treatment based on race, gender, or income. Some studies use a method called “Constitutional AI,” where AI guided by human ethics checks for bias. But this is still being developed.

Also, there is no agreement on ways to measure LLM quality. Metrics like ROUGE or BERTScore work well for general language tasks but often miss serious medical mistakes, like made-up facts or missing key details. This means AI can look good on tests but still cause problems in real healthcare.

The Role of Human Oversight and Clinical Training

No matter how good AI gets, health experts are needed to review AI outputs before using them with patients. In the U.S., laws like HIPAA and clinical rules require careful control and responsibility.

Training clinicians is important for:

  • Understanding AI results carefully.
  • Spotting errors or bias.
  • Using AI advice wisely with their own judgment.

Doctors, AI creators, and healthcare managers working together helps AI support doctors instead of replacing them.

HIPAA-Compliant Voice AI Agents

SimboConnect AI Phone Agent encrypts every call end-to-end – zero compliance worries.

AI and Workflow Automation in Healthcare Practices

LLMs do more than help with medical decisions. They also automate office tasks. This includes answering phones, handling patient questions, booking appointments, refilling prescriptions, and writing notes. For example, Simbo AI uses AI to handle front-office phone calls. This can improve patient communication and lower work for staff.

Many U.S. doctors feel burned out because of too much paperwork. AI can take over routine tasks so staff can focus more on patients and make operations smoother.

One helpful tool is AI “ambient scribes” that listen to doctor-patient talks, write notes, and put them into EHRs. Studies in 2023 and 2024 show these scribes can cut visit time by over 26%, reduce note-taking by more than 50%, and lower after-work paperwork by 61%. Providers say these tools help reduce mental stress and improve connection with patients.

But studies on AI scribes mostly use fake data and test few medical fields. Good evaluations need both computer scores and doctor reviews to be sure the tools work well and safely. U.S. managers choosing AI tools should consider privacy, ease of use, specialty fit, and secure integration.

Regulatory and Compliance Considerations

When using AI, healthcare offices in the U.S. must follow strict rules. HIPAA protects patient privacy and controls how health data is used and shared. AI must keep data safe in both training and daily use.

Tests must check privacy and security often. Staff should be trained to spot AI mistakes fast and have plans to stop or correct AI advice when needed.

Addressing Underrepresented Specialties and Future Directions

To improve LLMs in U.S. healthcare, future work should include more specialties like rare diseases, medical genetics, nuclear medicine, and allied health. Making models and tests fit each specialty will improve safety and usefulness.

Also, combining text, images, and other data types in LLMs can help with better diagnosis and treatment. For example, using CT scans with patient records might help doctors diagnose cancer or brain problems better.

The “Constitutional AI” method, where human values guide AI evaluation, may help monitor bias and safety without requiring too much manual effort. Medical managers and IT leaders should work with AI companies ready to include doctor feedback when updating models.

Practical Takeaways for U.S. Medical Practice Leaders

Hospital leaders, clinic owners, and IT managers in the United States should keep these points in mind when using LLMs:

  • Test on Real Data: Use real patient information to check models, not just exams or made-up cases.
  • Specialty Checks: Make sure tests and training match the medical fields in the practice.
  • Multiple Metrics: Check accuracy, bias, fairness, and efficiency together.
  • Clinician Training: Train staff to understand AI and give them ways to correct errors.
  • Data Privacy: Follow HIPAA and other laws strictly to keep patient data safe.
  • Workflow Tools: Consider tools like Simbo AI to reduce office work and improve patient contact safely.

Careful and responsible testing and use of LLMs will help make their benefits safe in U.S. healthcare. From meeting specialty needs to ethics and workflow, medical leaders should adopt AI carefully to improve patient care and operations.

Frequently Asked Questions

What is the main purpose of training AI answering services in healthcare?

The main purpose is to enhance the efficiency of handling patient inquiries specific to specialties, reducing physician workload and improving patient interactions.

What gaps exist in current AI answering services?

Current AI models, particularly LLMs, have shown safety errors in responses, which can pose risks to patient care.

Why is real patient care data crucial for training AI?

Real patient care data helps assess the AI’s performance accurately, ensuring its responses are relevant and safe for clinical use.

What are the common tasks LLMs are evaluated on?

Common tasks include medical knowledge enhancement, diagnostics, and treatment recommendations, but there is less focus on administrative tasks that can alleviate physician burnout.

What are some challenges in evaluating LLMs for healthcare?

Challenges include a lack of consensus on evaluation dimensions, reliance on curated data instead of real-world cases, and the need for more specialty-specific evaluations.

What is ‘Constitutional AI’?

‘Constitutional AI’ refers to the use of AI agents that adhere to human-defined principles, aiming to improve the evaluation and decision-making process.

How can the AI evaluation process be scaled?

Utilizing agents with human preferences can automate evaluation processes, allowing for faster and more efficient assessments of AI outputs.

What specialties are underrepresented in LLM evaluations?

Specialties like nuclear medicine, physical medicine, and medical genetics are notably underrepresented, yet they require tailored evaluations due to their unique priorities.

What dimensions need to be prioritized for LLM evaluation?

Key dimensions include accuracy, fairness, bias, robustness, and considerations for real-world deployment such as inference runtime and cost efficiency.

What is the potential future for LLMs in healthcare?

With improved evaluation, training on diverse real-world data, and attention to specialty-specific needs, LLMs could significantly reduce physician workload and enhance patient outcomes.