This is a single-centre, retrospective observational study using an existing dataset of pharmacological (dobutamine) stress echocardiography (SE) reports generated within Milton Keynes University Hospital over approximately 15 years, starting from 2002. The SE dataset comprises reports/letters produced by a single, experienced clinician, which reduces inter-observer variability and supports consistent interpretation across the cohort.
Data sources and cohort construction
SE reports (in document format) will be converted into a structured research database. A computer science team will develop a generalisable approach to extract structured variables from the clinical SE reports, building on prior proof-of-concept work demonstrating feasibility of converting these reports into a database.
The dataset includes clinical variables (e.g., cardiovascular risk factors, comorbidities, prescribed medications, and anthropometrics) alongside SE-derived measures (including ischaemia detection and wall motion scoring at rest and peak stress).
Stress echocardiography technique (context for imaging-derived variables)
The study dataset reflects contemporary dobutamine SE practice at MKUH, with contrast-enhanced imaging used in the majority of cases (SonoVue contrast with rota pump infusion equipment). Studies were performed predominantly on Philips echocardiography systems, with image acquisition across standard stages (resting, intermediate, peak stress, and recovery) and standard views (apical 4-, 2-, and 3-chamber; parasternal long- and short-axis). Reporting used dedicated platforms enabling stage-by-stage comparison.
Outcome ascertainment and linkage
Following database completion, a research nurse will query the hospital Electronic Data Management system to ascertain major adverse cardiovascular events (MACE) for the cohort. Where outcomes cannot be confirmed from hospital systems (e.g., patients no longer served by the hospital), missing outcome information will be explored via primary care physician contact and/or patient contact as appropriate.
Data processing, quality checks, and handling missingness
Extracted data will undergo cleaning prior to analysis. Natural Language Processing (NLP) and feature engineering approaches will be used to transform extracted information into model-ready features. As part of preprocessing, data fields will be checked for completeness and consistency before modelling. Missing outcome data will be addressed through the external outcome checks described above.
Statistical / machine learning approach and internal validation
After preprocessing, subset feature selection methods will be applied to identify the most informative predictors for risk classification. Supervised learning will be used to discriminate between lower-risk cases and cases requiring further investigation, with additional modelling approaches (including regression techniques) planned to support quantification of disease stage in abnormal cases. Overfitting will be mitigated through use of techniques robust to overfitting (e.g., ensemble methods) and internal validation using k-fold cross-validation (five folds), ensuring separation of training and validation data.
Sample size and additional analyses
The study will utilise the available full dataset (approximately 3,000 patients) to maximise model development and internal validation. A cost analysis is also planned using the available data.