Method for Anonymization of Clinical Free-Text Reports

Application

An automated anonymization method for processing many DICOM files and clinical records from multiple institutions.

Key Benefits

Potentially improve quality and efficiency of clinical studies.
Faster method for anonymization of protected health information (PHI).

Market Summary

The management of sensitive clinical and personal information is of paramount concern in today’s world of digital file creation, sharing, and storage. De-identification involves replacing health information that could identify an individual such as patient identifiers, addresses, dates, or any of the 18 protected Health Insurance Portability and Accountability Act (HIPAA) categories with new values which are only linked to the original patient using a separately stored key. Conversely, data anonymization involves destroying all links between the original and the anonymized datasets such that patients can never be re-identified. De-identification is preferred when subsequent records from the same patient may need to be identified and extracted to merge with an existing database, whereas anonymization is irreversible, and patients can never be relinked. Both methods enable safe record keeping and allows for data sharing and collaboration between clinical research groups, a key factor in advancing scientific discovery. To date, most methods of de-identification focus on structure or tabular data which exists in expected locations in the electronic medical record (EMR) or other metadata for imaging files. These methods function well if the end user is knowledgeable about the expected location of PHI. However, there is no existing robust technique to fully de-identify free-text clinical records which can contain PHI in any location. These clinical notes are heterogenous, both in type, style, prose, and author which presents a unique challenge for de-identification. Novel mechanisms are needed to de-identify or anonymize such records without destroying clinically relevant portions of the note. These methods must also be robust, testable, and able to be continually updated to account for data shifts.

Technical Summary

Emory inventors have developed a technology that de-identifies free text clinical records using a novel whitelist approach. The method ingests 100,000s of clinical notes and generates a dictionary of terms based on their frequency, stems, and context. The system i) a method of labeling words either “safe” or “unsafe” based on their association with PHI and ii) a library of such labeled words to be referenced by an automated de-identification tool during the anonymization process. The method for generating clean versions (sensitive information free) of clinical reports uses a six-step process to eliminate and replace identified sensitive (“unsafe”) words. Through this process a library of words categorized unsafe, safe, unsure (for manual labeling) is generated which can be licensed for use in anonymization software.