Mon, 08/26/2019

You hear a lot in the news about Artificial Intelligence (AI) and Machine Learning (ML) applied to the healthcare industry. And there is no doubt that this technology trend has significant potential to change healthcare delivery. And it is not just care management and delivery. There is significant potential for AI/ML contribution to billing management, claims, cost management and control, staffing, and inventory management. But the key to unlocking the potential in all of those areas is data.  And that is where the biggest barrier to success lurks ready to sabotage these initiatives. 

A typical workflow for a machine learning project involves (at least) the following functionalAI and ML in Healthcare activities:

  1. Acquire data

  2. Prepare the data

  3. Model Training

  4. Model evaluation

  5. Model deployment

Note that there is a critical dependence on the data in this process. In healthcare, we clearly do not have a problem with the availability of data. In an IDC study last year sponsored by Seagate, analysts projected that healthcare data would grow faster than every other industry through 2025[1].    But the sheer volume of data presents both an opportunity and challenges for its practical use.  And the success of a machine learning initiative will be directly dependent on the quality and reliability of the data used to train and evaluate the model.  Inconsistent, incomplete or noisy data can severely skew evaluation rendering the project results unreliable.

These data challenges in healthcare are well known and why we invest so much time and effort to building systems to improve the quality and reliability of data.  And the real issue is not that healthcare data in inherently flawed any more than any other industry.  It stems from the fact that the data is created and curated in a purpose built way.  The primary purposes of the systems that are used to manage a healthcare delivery network are to automate the processes that manage the delivery of care to patients, manage the workforce or manage finances.  When the data from those various systems are combined to contribute to new analysis, business process or as a base for machine learning, the data will need to be prepared properly.

That means that we need to ensure that the data is checked and governed to ensure it is consistent. Often data has gaps when we compare system to system due to the fact that not all data is required for all systems. The data will also need harmonization so that there is an agreed set of definitions that drive classification regardless of how the source system represents the data. For example, if I use two different measurement techniques for observations (LOINC in one system and a proprietary system in another) we need to harmonize these codes before we can consume them in a model for the model to make sense. Otherwise the same observation will be evaluated two different ways. And we often need to ensure that master data for key domains is reliable as often these elements are anchor points (patient, provider, facility, employee, machine, etc.) that drive measurements that we intend to model and evaluate. All of this amounts to developing an approach not just to one dataset (although often we may do that for expediency’s sake) but to the entire population of data that has the potential for us to identify and utilize in these projects.

And if I am not sure whether or not my data is in good condition, then there is a high probability that it is not. If you proactively ensure that your data accurately and consistently represents the process you want to model, evaluate and improve through machine learning and eventually feed this into artificial intelligence, your probability of success improves dramatically.


[1] https://www.seagate.com/www-content/our-story/trends/files/idc-seagate-datcon-healthcare.pdf