Vision-language models are a type of artificial intelligence that can look at images and understand words at the same time. In pathology, whole slide images (WSIs) come from biopsies and tissue samples. These images show detailed views of cells and diseases. Pathologists usually look at these images by hand to find problems. This takes a lot of time and can have mistakes.
Models like HistoGPT, made by a company called Metrum AI, help labs analyze these detailed images using special vision-to-language methods. HistoGPT can look at big, clear images, find disease signs, and write reports that doctors can understand. It also works with other AI tools like voice transcription and language models to make the whole process smooth, from checking images to making reports.
Because it combines image and language processing in one system, it can make results more steady, help pathologists with exact disease information, and cut down on paperwork. Using these models lets labs finish reports faster and spend less time writing notes by hand.
In the U.S., there are many pathology cases, especially for diseases like skin cancer. This puts a lot of pressure on doctors and lab workers. Getting the diagnosis right and on time is very important so patients get care quickly. But labs often have delays because many steps need to be done by hand.
Using vision-language models to automate image analysis helps hospitals and clinics handle more patients without losing quality. For example, Metrum AI’s healthcare assistant uses four AI models, including HistoGPT for reading images, OpenAI Whisper for turning spoken words into text, and large language models like Llama 3.1 70B to make patient summaries. This system can work with voice, text, and image data at the same time on powerful computers, like Dell’s PowerEdge XE9680 server, which has eight AMD Instinct MI300X processors.
This setup provides very fast computing power, up to 10.4 petaflops for each GPU, and has 1.5TB of fast memory. The AI can check pathology images, turn spoken clinical sessions into text, and make full reports all in real time. This speeds up results and helps clinics serve more patients and give care faster.
The use of vision-language models in pathology fits into many steps of clinical workflows. Medical administrators and IT staff can see how this system works integration-wise like this:
This automatic process cuts down on paperwork by reducing manual typing and note-taking, which often cause mistakes and waste time.
This system depends on strong server hardware made for running big AI tasks. Dell’s PowerEdge XE9680 servers use AMD Instinct MI300X processors to give the memory and speed needed to run many AI apps at once. Each processor has 192GB of fast HBM3 memory and can reach 10.4 petaflops. This lets hospitals run big models like Llama 3.1 with 70 billion settings.
This hardware lets healthcare places use several complex AI models—vision-language, text embeddings, and voice transcription—on one server without slowing down or making errors. Large memory and computing power help handle lots of pathology images and patient data fast and correctly. The system also supports many clinical sessions happening at the same time, making workflows faster.
Retrieval-Augmented Generation, or RAG, is a method that helps AI by letting language models look up facts from outside databases when they reply. This is important in healthcare to make sure AI answers are correct, useful, and follow current medical rules.
In pathology, when AI looks at an image and writes a report, it can also use recent medical articles, treatment guides, and past patient info. This helps the AI give better advice and avoid old or missing info that could hurt patient care. Tools like LlamaIndex manage this knowledge search while the AI writes reports, making sure reports are evidence-based.
Using vision-language models in hospitals is more than just reading images. These AI systems also help with overall workflow and management, which helps medical office workers and owners improve how they work and manage patients.
Main workflow benefits include:
Using AI like this can improve how clinics and hospitals run at every level, from regular offices to special pathology labs.
Although skin cancer and dermatology are main examples, AI vision-language models also help many other medical fields. Areas like radiology, cardiology, oncology, and pathology can use automated image reading with voice and text processing.
Metrum AI’s healthcare assistant combines four AI parts (vision-language, text embeddings, voice transcription, and large language models) on one powerful system. This not only speeds up diagnosis but also allows constant model updates to keep accuracy high in fast-changing healthcare settings.
For U.S. healthcare systems handling many patients and high workloads, using AI tools like these can improve patient care and reduce staff stress.
Healthcare leaders, practice owners, and IT staff thinking about vision-language AI tools should keep in mind these points for success:
AI technology is growing and new AI systems are being made to work more independently and flexibly in medical care. These new systems use many types of data and smart reasoning to improve diagnosis, treatment, and patient checks on their own.
Soon, AI might not just read images and turn voice into text. It could also predict how patients will do, plan resources better, and even help in robot-assisted procedures in clinics. If this is done carefully with rules and teamwork, AI could help many people in the U.S. get better healthcare, including those in areas that don’t have many doctors.
By using vision-language models and related AI tools in pathology, medical offices in the U.S. can cut down delays in diagnosis and paperwork. At the same time, they can make reports more accurate and steady. These changes may lead to better patient care and easier workloads for healthcare workers, even as patient numbers grow.
It is an AI-powered healthcare assistant that integrates multiple data types—such as voice, text, and images—using Retrieval-Augmented Generation (RAG) to analyze pathology images, transcribe clinical audio, and generate comprehensive patient summaries, thereby improving clinical workflows and patient outcomes.
The server, equipped with eight AMD Instinct MI300X accelerators and 192GB HBM3 memory each, provides exceptional memory capacity and computational power needed to deploy large multi-parameter models like Llama 3.1 70B. It supports multiple AI models simultaneously, enabling efficient handling of language, vision, text embeddings, and voice tasks critical for RAG-based healthcare applications.
RAG enhances natural language generation by dynamically retrieving relevant external knowledge from large databases, improving factual accuracy and contextual relevance of AI-generated responses. This is critical in healthcare for accurate clinical documentation, decision support, and up-to-date patient information integration.
It leverages the HistoGPT vision-language model to analyze high-resolution pathology whole slide images, generating detailed disease reports. This automates and accelerates diagnostic image interpretation, reducing manual workload while providing precise insights to support clinicians.
The solution stack includes HistoGPT for pathology image analysis, Orthanc DICOM server for medical image management, OpenEMR for electronic health records, OpenAI Whisper for audio transcription, top-ranking text embeddings models, Llama 3.1 70B large language model, LlamaIndex for RAG framework, MilvusDB vector database, and vLLM for optimized LLM serving.
Using OpenAI Whisper transcription, the assistant converts clinical audio recordings into accurate text notes, reducing administrative time and errors associated with manual record-keeping, enabling healthcare providers to focus more on patient care.
A user selects a patient, starts a session, uploads clinical audio for transcription, views transcriptions, generates patient summaries integrating text and pathology reports, reviews histopathology reports, saves final reports, and ends the session, allowing streamlined, multimodal data management within one interface.
By automating documentation and pathology analysis, reducing wait times, and alleviating clinician workloads, the system allows more patients to be seen efficiently, improving diagnostic accuracy and enabling timely, informed clinical decision-making, directly enhancing patient care quality.
The AMD Instinct MI300X delivers up to 10.4 Petaflops of BF16/FP16 compute performance, with 192GB of GPU memory per accelerator, supporting full LLM deployment and multi-model serving. The PowerEdge XE9680 server with eight accelerators aggregates 1.5TB HBM3 memory, scaling token throughput ~7.9x with increased concurrent requests.
Any medical specialties involving voice and imaging data—such as radiology, pathology, cardiology, or oncology—can leverage the assistant for automated image analysis, audio transcription, clinical documentation, and summary generation, enabling broader adoption for diverse healthcare workflows and improved patient management.