Unstructured Data: What Lies Beneath the Surface

70-80% of the information in clinical EHR data is contained within unstructured notes. Notes contain unique information about patient conditions, treatment responses, and side effects as described by healthcare professionals. None of this information is found within structured data.

Patient notes provide valuable information which can be used to identify disease systems, treatment responses, and adverse events (which may not be coded in structured data) through the use of natural language processing (NLP) and other AI methods. Unstructured data can be used to monitor trends and identify potential safety issues earlier in the product life cycle using machine learning techniques. Longitudinal notes are valuable in analyzing disease progression and the impact of treatment. And, training data for models predicting patients at risk of poor outcomes and complications can only be found within the unstructured data.

Unstructured notes represent vast diversity and scope. There are more than twenty primary note types (e.g. Anesthesia, Nursing, Progress, Procedure) and thousands of sub types (for example, within the Anesthesia Group, the list includes Anesthesiology Progress Note, Anesthesia Day of Surgery and Post Anesthesia Evaluation).

While clinical notes are a valuable asset, discerning useful information from the entire collection of available notes is necessary to harnessing the potential of unstructured EHR notes for analytics, model development, and clinical care. Some of the challenges presented in refining clinical notes include:

  • Data Volume and Complexity- Unstructured data from EHRs and the tools required to create useful information from the data (e.g. AI algorithms) require significant processing power and storage.

  • Lack of standardization- EHR data varies significantly from one EHR to another, as well as from provider to provider in the same system leading to inconsistent data representation. Terminology itself may also vary, including the use of both metric and imperial measures, different coding systems (e.g. HCPCS and CPT codes) and even Latin acronyms such as PO (by mouth), BID (twice a day) and QHS (every bedtime).

  • Data Quality- Typos, incorrect entries, and incomplete documentation reduces data reliability. Notes often contain repeated information and irrelevant narrative text

  • Data Bias- Missing important patient populations (e.g. rural, economically disadvantaged populations) may introduce bias into models and analytics.

All of these challenges can most productively be viewed as ”features” of clinical notes and not “bugs.” Applying a model or analytic framework to a new set of hospitals will reveal the same, but different, challenges.

The Institute for Health Metrics (IHM) manages a collaborative database sourced from community and rural hospitals from across the United States. Hospitals like these and their patients are highly underrepresented in most other datasets that are available commercially. Overlooking patients like these introduces the risk of bias into analytic and modelling initiatives.

IHM extracts both the structured and unstructured data daily from the hospital’s EHR and standardizes the data into a consistent data model. These data are deidentified and certified using the expert determination method, and licensed to pharmaceutical firms, AI developers, and Medical Device companies for analytics and modeling. In return, our member hospitals receive data and services at no charge which are used to meet regulatory and community requirements.

For more information about unstructured clinical notes and IHM, contact us.