“Medical Notes — An Underutilized Resource.”

Getting the most using the Google Healthcare Natural Language Processing API to derive insights from unstructured medical text.

Published in

Google Cloud - Community

5 min readDec 30, 2020

According to HealthMed Magazine, “80 percent of healthcare data is unstructured. Not only do organizations need tools to look back at the legacy data they have already stored, but they also have to deal with increasing amounts of data being produced every day. With the addition of connected medical and Internet of Things (IoT) devices, organizations are collecting unstructured data at an alarming rate.”

From parcel delivery to healthcare and every industry in between, the trend for collecting ever more data is clear. Thought leaders recognize that collecting real-time data at the device level creates the ability to optimize processes, get earlier warnings, detect functional gaps, and automate everything. In aggregate the same data can be leveraged for larger system-wide analytics to determine patterns, and in the case of healthcare, improve population health outcomes.

Despite the enormous potential, processing unstructured medical data — such as caregiver notes — has always been a challenge. A lack of standardization in the use of some medical terminology means a clinician may use a different term to refer to the same medication or condition. Even if the underlying language and terminology are the same, unique speech patterns across the globe further complicate the situation.

Until now, data scientists in healthcare organizations that wanted to include their medical notes as structured data for use in data science had to build entire language models, exhaustively test those models, and then — assuming a successful model could be created — deploy that model at scale. The process to get to that point is arduous, and in the end the model must continue to be iterated on over time to keep it up to date.

The Google Healthcare Natural Language Processing (NLP) API offers a REST API which you can use with your own medical text. It returns an organized structured data set with standardized names mapped to the terminology used. This allows data scientists to derive insights using these standardized names for conditions, medications, and so forth without the need to create those mappings themselves. The process reduces the potential for human error associated with these types of tasks. The Healthcare NLP API provides details such as:

Subject (“Who or What”) — Is this text about the patient, a family member, or in reference to family medical history?
Temporal Relevance (“When”) — Is this a current condition, one that was experienced in the past, or possibly a future diagnosis?
Entities (“What”) — Medical objects referred to in the text.
Linked Entities (“What” extended) — Entities mapped to standardized names.
Relationships — A relationship graph visually depicting how entities are related.
Likelihood — Given language nuances, how certain is the algorithm that it is accurate for a given statement

Reference Links:

Using the Healthcare Natural Language API

More showcased in the example below.

To demonstrate, we will start with a Kaggle dataset that contains medical transcriptions. Then we will build an Apache Beam pipeline that incorporates the Healthcare NLP API, and then show what results are returned for a given text example.

The data pipeline is built on a Dataflow notebook on Google Cloud Platform, and is based on this example found on Github. While the example shows batch processing, the PubSubIO connector could be used to adapt the pipeline for streaming records.

This pipeline stores the Healthcare NLP API results in BigQuery, which then makes it possible for anyone with basic SQL skills to query this data to derive insights. The chart below just scratches the surface in terms of what can be gleaned. In addition, the administrative burden of reading through and summarizing this mountain of data has been removed and handled in a matter of minutes.

We can now evaluate the number of past, upcoming, and possible future events for a given procedure. For example, anesthesia was administered 297 times the day the clinical notes were captured, and we can see that there are 11 more procedures scheduled. This data can be used to improve planning and optimize resource utilization.

Upcoming/Past/Future Summary for Number of Procedures

Let’s take a look at an actual sample of medical text and the results provided:

“SUBJECTIVE:, This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over-the-counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals: Weight was 130 pounds and blood pressure 124/78.,HEENT: Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear.,Neck: Supple without adenopathy.,Lungs: Clear.,ASSESSMENT:, Allergic rhinitis.,PLAN:,1. She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. She does not think she has prescription coverage so that might be cheaper.,2. Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well.”

{"entityMentions": [ ← This array holds information about the entities identified in the medical transcription
  {
    "mentionId": "1", ← This is a unique id and you will use in in     “relationships” array
    "type": "PROBLEM",
    "text": {
        "content": "allergies",
        "beginOffset": 71
     },
    "linkedEntities": [
       { 
          "entityId": "UMLS/1527304" ← Look for this entity in the “Entities array”
       },
       {
          "entityId": "UMLS/20517" ← Look for this entity in the “Entities array”
       }
    ],
    "temporalAssessment": {
      "value": "CURRENT", ←  Is this current/past ?
      "confidence": 0.9958259463310242
    },    "certaintyAssessment": { ← Likelihood of the “PROBLEM”
      "value": "LIKELY",
      "confidence": 0.9998759031295776
    },
    "subject": {
      "value": "PATIENT", ← The subject of the entity could be a patient or family member
      "confidence": 0.999998927116394
     },
    "confidence": 0.9866036176681519
  },… 
 ],
   "relationships": [ ← This maps relationships between the various "entityMentions"
     {
     "subjectId": "9", ← Looking up mentionId = 9 is PROBLEM:asthma
     "objectId": "11", ← Looking up mentionId = 11 is MEDICINE:
     "confidence": 0.7835226058959961
     }
...
 ]
}

Now that you have data on how the pipeline identifies entities and relationships, the next step would be to share the pipeline.

Summary

In this blog we discussed the issue with unstructured medical data and how to use Google Cloud Platform — in particular, Dataflow, BigQuery, and the Healthcare NLP API — to analyze, evaluate, and gain insights from it.

“Medical Notes — An Underutilized Resource.”

Getting the most using the Google Healthcare Natural Language Processing API to derive insights from unstructured medical text.

Summary

Written by Ruchika Kharwar