Many AI tools in healthcare use electronic health record (EHR) data, but most current datasets have problems. One popular dataset, MIMIC, has useful clinical data but only covers about 5% of research done with EHR data. It does not track patient health over time. Instead, it focuses on one visit or snapshot, which makes it hard to study how diseases change or how treatments work over time.
Without data from many visits and full patient histories, AI models cannot learn how chronic diseases develop or predict outcomes well. Also, many datasets miss multimodal data, like pictures (imaging) combined with clinical notes. This makes it harder for AI to understand both visual and text information together. Another problem is that different studies split data in different ways, which hurts the ability to compare and repeat results.
For healthcare managers and IT staff, these issues make it harder to trust and use AI tools. AI that cannot predict disease changes correctly may cause bad choices or legal problems.
To fix these gaps, Stanford University created three new datasets: EHRSHOT, INSPECT, and MedAlign. They include anonymous long-term EHR data from about 26,000 patients, covering over 441,000 visits and almost 295 million clinical events. These datasets cover a long time and show detailed patient health journeys, which helps in managing chronic diseases.
These datasets use standardized Common Data Models (CDMs), like the OMOP CDM 5.4 format and the Medical Event Data Standard (MEDS). This makes it easier for AI tools to work together. Consistent data formats help U.S. healthcare systems and researchers use the same tools and methods for building and testing AI.
For AI to work well in healthcare, models must know how diseases and treatments affect patients over many visits. These datasets help by capturing patient histories from past visits and tracking health events over time. This adds needed context and helps AI make better predictions. It can also find signs that point to a disease getting better or worse.
Datasets like INSPECT give AI the chance to look at both images and text. This works like how doctors check scans along with medical notes. This helps train models that can handle different data types at once, which is important for complex tasks like watching cancer or stroke progress.
Long-term datasets with set training, testing, and validation splits make sure experiments can be copied. Researchers, companies, and hospitals can fairly compare AI models without test data mixing into training. This is important for health systems with strict quality and safety rules, such as Joint Commission or CMS standards.
The new datasets set fixed splits for training, validation, and testing. This helps make sure AI models are judged the same way. Before, different testing methods made comparisons hard and slowed down progress.
Standard benchmarks also follow privacy laws like HIPAA by using anonymous data and strict access rules. Researchers must apply through secure portals, sign agreements, and complete training like the CITI certification.
For medical leaders and IT groups, these rules give confidence that AI tools meet privacy and usage standards. Strong benchmarking also matches what payors and regulators want when approving AI tools for clinical use.
Even with these datasets, problems like limited data and privacy remain. To help, synthetic data is used more often. Synthetic data is fake but looks like real patient data. It is made using methods like deep learning.
One review found that about 73% of studies used deep learning to create synthetic healthcare data in several types: tables, images, radiomics, time series, and omics. Most synthetic data tools are built with Python, a common programming language for medical IT.
Synthetic data helps by:
For healthcare IT managers, synthetic data lets them test AI tools inside their systems without breaking privacy rules or only using outside data.
Simbo AI shows how front-office automation like answering phones and scheduling can work with better AI tools in healthcare. This technology automates patient calls, appointment booking, and first symptom checks. It helps staff by reducing work and making sure patient info is correct.
Good patient data from multimodal and long-term datasets helps automation by:
These features reduce errors from manual entries, lower patient wait times, and allow managers to use human workers more efficiently.
Automation systems can also use synthetic data to test and improve AI models before using them with real patients. This lowers the chance of problems in daily operations.
As healthcare groups in the U.S. use more AI, it is important to understand these changing data tools. Key points include:
Healthcare AI in the U.S. is at an important time. Fixing data problems will affect how quickly AI is adopted. The new datasets, standards, and synthetic data offer a better base for AI in managing long-term diseases and personalized care. AI automation can help clinics run more smoothly and serve patients with better accuracy and speed.
Medical managers, practice owners, and IT leaders will gain from knowing these changes so they can make good choices about AI investments and uses. These choices should match current data rules, privacy laws, and clinical needs in American healthcare.
Longitudinal EHR data provides complete patient trajectories over extended periods, essential for tasks like chronic disease management and care pathway optimization. It addresses the missing context problem by capturing past and future health events, enabling AI models to learn complex, long-term health patterns which static datasets like MIMIC lack.
MIMIC, while impactful, lacks longitudinal health data covering long-term patient care trajectories, limiting its use for evaluating AI models on tasks requiring multi-visit predictions and chronic disease management. It also presents gaps in population representation and does not facilitate standardized benchmarking due to inconsistent train/test splits among researchers.
Stanford developed three de-identified longitudinal EHR datasets—EHRSHOT, INSPECT, and MedAlign—containing nearly 26,000 patients, 441,680 visits, and 295 million clinical events. These datasets offer detailed multi-visit patient data, including structured and unstructured data like CT scans and clinical notes, to enable rigorous and standardized AI evaluation.
They include canonical train/validation/test splits and defined task labels, enabling reproducible and comparable model evaluations across research. This removes the need for costly retraining and prevents data leakage, promoting a unified leaderboard that tracks state-of-the-art performance on clinical prediction and classification tasks.
They are released in the OMOP CDM 5.4 format to support broad interoperability and statistical tools. Additionally, to enhance foundation model development, they adopt the Medical Event Data Standard (MEDS), developed collaboratively by leading institutions, alongside tools like MEDS Reader to accelerate data loading and usability.
Access requires application via a Redivis data portal, signing a data use agreement and behavioral rules, and possessing valid CITI training certificates. These protocols, modeled after PhysioNet’s approach with MIMIC, ensure responsible usage and protection of patient privacy despite de-identification.
They combine structured data with unstructured modalities such as paired CT scans and radiology notes (INSPECT) or extensive clinical notes across diverse types (MedAlign). This multimodal approach supports comprehensive context understanding, crucial for vision-language model pretraining and identifying prognostic markers.
Healthcare AI requires understanding a patient’s complete medical history and future outcomes to infer accurate prognoses and treatment effects. Missing context impedes models’ ability to learn meaningful correlations across longitudinal health events, limiting their clinical applicability and robustness.
Stanford released 20 pretrained EHR foundation models, including transformers like CLMBR and MOTOR, designed for diverse clinical tasks. These models respect dataset splits and serve as baselines for comparison, accelerating research by providing ready-to-use architectures for training and benchmarking.
The FactEHR dataset is forthcoming, focusing on factual decomposition and verification using clinical notes from MIMIC and MedAlign. The roadmap emphasizes building a robust ecosystem with educational resources, open-source tools, and collaborations to enable scalable and equitable AI in healthcare.