A de-identifier for medical discharge summaries

https://doi.org/10.1016/j.artmed.2007.10.001Get rights and content

Summary

Objective

Clinical records contain significant medical information that can be useful to researchers in various disciplines. However, these records also contain personal health information (PHI) whose presence limits the use of the records outside of hospitals.

The goal of de-identification is to remove all PHI from clinical records. This is a challenging task because many records contain foreign and misspelled PHI; they also contain PHI that are ambiguous with non-PHI. These complications are compounded by the linguistic characteristics of clinical records. For example, medical discharge summaries, which are studied in this paper, are characterized by fragmented, incomplete utterances and domain-specific language; they cannot be fully processed by tools designed for lay language.

Methods and results

In this paper, we show that we can de-identify medical discharge summaries using a de-identifier, Stat De-id, based on support vector machines and local context (F-measure = 97% on PHI). Our representation of local context aids de-identification even when PHI include out-of-vocabulary words and even when PHI are ambiguous with non-PHI within the same corpus. Comparison of Stat De-id with a rule-based approach shows that local context contributes more to de-identification than dictionaries combined with hand-tailored heuristics (F-measure = 85%). Comparison with two well-known named entity recognition (NER) systems, SNoW (F-measure = 94%) and IdentiFinder (F-measure = 36%), on five representative corpora show that when the language of documents is fragmented, a system with a relatively thorough representation of local context can be a more effective de-identifier than systems that combine (relatively simpler) local context with global context. Comparison with a Conditional Random Field De-identifier (CRFD), which utilizes global context in addition to the local context of Stat De-id, confirms this finding (F-measure = 88%) and establishes that strengthening the representation of local context may be more beneficial for de-identification than complementing local with global context.

Introduction

Medical discharge summaries can be a major source of information for many studies. However, like all other clinical records, discharge summaries contain explicit personal health information (PHI) which, if released, would jeopardize patient privacy. In the United States, the Health Information Portability and Accountability Act (HIPAA) provides guidelines for protecting the confidentiality of patient records. Paragraph 164.514 of the Administrative Simplification Regulations promulgated under the HIPAA states that for data to be treated as de-identified, it must clear one of two hurdles:

  • 1.

    An expert must determine and document “that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information.”

  • 2.

    Or, the data must be purged of a specified list of seventeen categories of possible identifiers, i.e., PHI, relating to the patient or relatives, household members and employers, and any other information that may make it possible to identify the individual [1]. Many institutions consider the clinicians caring for a patient and the names of hospitals, clinics, and wards to fall under this final category because of the heightened risk of identifying patients from such information [2], [3].

Of the 17 categories of PHI listed by HIPAA, the following appear in medical discharge summaries: first and last names of patients, of their health proxies, and of their family members; identification numbers; telephone, fax, and pager numbers; geographic locations; and dates. In addition, names of doctors and hospitals are frequently mentioned in discharge summaries; for this study, we add them to the list of PHI. Given discharge summaries, our goal is to find the above listed PHI and to replace them with either anonymous tags or realistic surrogates.

Medical discharge summaries are characterized by fragmented, incomplete utterances and domain-specific language. As such, they cannot be effectively processed by tools designed for lay language text such as news articles [4]. In addition, discharge summaries contain some words that can appear both as PHI and non-PHI within the same corpus, e.g., the word Huntington can be both the name of a person, “Dr. Huntington”, and the name of a disease, “Huntington's disease”. They also contain foreign and misspelled words as PHI, e.g., John misspelled as Jhn and foreign variants such as Ioannes. These complexities pose challenges to de-identification.

An ideal de-identification system needs to identify PHI perfectly. However, while anonymizing the PHI, such a system needs to also protect the integrity of the data by maintaining all of the non-PHI, so that medical records can later be processed and retrieved based on their inclusion of these terms. Almost all methods that determine whether a target word,1 i.e., the word to be classified as PHI or non-PHI, is PHI base their decision on a combination of features related to the target itself, to words that surround the target, and to discourse segments containing the target. We call the features extracted from the words surrounding the target and from the discourse segment containing the target the context of the target. In this paper, we are particularly interested in comparing methods that rely on what we call local context, by which we mean the words that immediately surround the target (local lexical context) or that are linked to it by some immediate syntactic relationship (local syntactic context), and global context, which refers to the relationships of the target with the contents of the discourse segment containing the target. For example, the surrounding k-tuples of words to the left and right of a target are common components of local context, whereas a model that selects the highest probability interpretation of an entire sentence by a Markov model employs sentential global context (where the discourse segment is a sentence).

In this paper, we present a de-identifier, Stat De-id, which uses local context to de-identify medical discharge summaries. We treat de-identification as a multi-class classification task; the goal is to consider each word in isolation and to decide whether it represents a patient, doctor, hospital, location, date, telephone, ID, or non-PHI. We use support vector machines (SVMs), as implemented by LIBSVM [5], trained on human-annotated data as a means to this end.

Our representation of local context benefits from orthographic, syntactic, and semantic characteristics of each target word and the words within a ±2 context window of the target. Other models of local context have used the features of words immediately adjacent to the target word; our representation is more thorough as it includes (for a ±2 context) local syntactic context, i.e., the features of words that are linked to the target by syntactic relations identified by a parse of the sentence. This novel representation of local syntactic context uses the Link Grammar Parser [6], which can provide at least a partial syntactic parse even for incomplete and fragmented sentences [7]. Note that syntactic parses can be generally regarded as sentential features. However, in our corpora, more than 40% of the sentences only partially parse. The features extracted from such partial parses represent phrases rather than sentences and contribute to local context. For sentences that completely parse, our representation benefits from syntactic parses only to the extent that they help us relate the target to its immediate neighbors (within two links), again extracting local context.

On five separate corpora obtained from Partners Healthcare and Beth Israel Deaconess Medical Center, we show that despite the fragmented and incomplete utterances and the domain-specific language that dominate the text of discharge summaries, we can capture the patterns in the language of these documents by focusing on local context; we can use these patterns for de-identification. Stat De-id, presented in this paper, is built on this hypothesis. It finds more than 90% of the PHI even in the face of ambiguity between PHI and non-PHI, and even in the presence of foreign words and spelling errors in PHI.

We compare Stat De-id with a rule-based heuristic + dictionary approach [8], two named entity recognizers, SNoW [9] and IdentiFinder [10], and a Conditional Random Field De-identifier (CRFD). SNoW and IdentiFinder also use local context; however, their representation of local context is relatively simple and, for named entity recognition (NER), is complemented with information from sentential global context, i.e., the dependencies of entities with each other and with non-entity tokens in a single sentence. CRFD, developed by us for the studies presented in this paper, employs the exact same local context used by Stat De-id and reinforces this local context with sentential global context. In this manuscript, we refer to sentential global context simply as global context. Because medical discharge summaries contain many short, fragmented sentences, we hypothesize that global context will add limited value to local context for de-identification, and that strengthening the representation of local context will be more effective for improving de-identification. We present experimental results to support this hypothesis: on our corpora, Stat De-id significantly outperforms all of SNoW, IdentiFinder, CRFD, and the heuristic + dictionary approach.

The performance of Stat De-id is encouraging and can guide research in identification of entities in corpora with fragmented, incomplete utterances and even domain-specific language. Our results show that even on such corpora, it is possible to create a useful representation of local context and to identify the entities indicated by this context.

Section snippets

Background and related work

A number of investigators have developed methods for de-identifying medical corpora or for recognizing named entities in non-clinical text (which can be directly applied to at least part of the de-identification problem). The two main approaches taken have been either (a) use of dictionaries, pattern matching, and local rules or (b) statistical methods trained on features of the word(s) in question and their local or global context. Our work on Stat De-id falls into the second of these

Definitions

We define the PHI found in medical discharge summaries as follows:

  • Patients: include the first and last names of patients, their health proxies, and family members. Titles, such as Mr., are excluded, e.g., “Mrs. [Lunia Smith]patient was …”.

  • Doctors: include medical doctors and other practitioners. Again titles, such as Dr., are not considered part of PHI, e.g., “He met with Dr. [John Doe]doctor ”.

  • Hospitals: include names of medical organizations. We categorize the entire institution name as PHI

Hypotheses

We hypothesize that we can de-identify medical discharge summaries even when the documents contain many fragmented and incomplete utterances, even when many words are ambiguous between PHI and non-PHI, and even in the presence of foreign words and spelling errors in PHI. Given the nature of the domain-specific language of discharge summaries, we hypothesize that a thorough representation of local context will be more effective for de-identification than (relatively simpler) local context

Corpora

We tested our methods on five different corpora, three of which were developed from a corpus of 48 discharge summaries from various medical departments at the Beth Israel Deaconess Medical Center (BIDMC), the fourth of which consisted of authentic data including actual PHI from 90 discharge summaries of deceased patients from Partners HealthCare, and the fifth of which came from a corpus of 889 de-identified discharge summaries, also from Partners. The sizes of these corpora and the

Methods: Stat De-id

Categories of PHI are often characterized by local context. For example, the word Dr. before a name invariably suggests that the name is that of a doctor. While titles such as Dr. provide easy context markers, other clues may not be as straightforward, especially when the language of documents is dominated by fragmented and incomplete utterances. We created a representation of local context that is useful for recognizing PHI even in fragmented, incomplete utterances.

We devised Stat De-id, a

Baseline approaches

We compared Stat De-id with a scheme that relies heavily on dictionaries and hand-built heuristics [8], with Roth and Yih's SNoW [9], with BBN's IdentiFinder [10], and with our in-house Conditional Random Field De-identifier (CRFD). SNoW, IdentiFinder, and CRFD take into account dependencies of entities with each other and with non-entity tokens in a sentence, i.e., sentential global context, while Stat De-id focuses on each word in the sentence in isolation, using only local context provided

Precision, recall, and F-measure

We evaluated the de-identification and NER systems on four artificial and one authentic corpora. We evaluated Stat De-id using 10-fold cross-validation; in each round of cross-validation we extracted features only from the training corpus, trained the SVM only on these features, and evaluated performance on a held-out validation set. To compare with the performance of baseline systems, we computed precision, recall, and F-measures for each system. Precision for class x is defined as β/B where β

De-identifying random and authentic corpora

We first de-identified the random and authentic corpora. On the random corpus, Stat De-id significantly outperformed all of IdentiFinder, CRFD, and the heuristic + dictionary baseline. Its F-measure on PHI was 97.63% compared to IdentiFinder's 68.35%, CRFD's 81.55%, and the heuristic + dictionary scheme's 77.82% (see Table 4).7

Multi-class SVM results and implications for future research

The goal of this paper is to separate PHI from non-PHI for de-identification purposes. De-identification is achieved simply by discarding the PHI that are found, or by replacing the PHI with anonymous tags or surrogates. We have so far shown that Stat De-id recognizes 94–97% of the PHI and outperforms all other systems. However, from a policy perspective, the adequacy of the performance of Stat De-id depends on the PHI that are missed. Not all PHI are equally strong identifiers of individuals.

Conclusions

In this paper, we have shown that we can de-identify clinical text, characterized by fragmented and incomplete utterances, using local context 94–97% of the time. Our representation of local context is novel; it includes novel syntactic features which provide us with useful linguistic information even when the language of documents is fragmented. The results presented imply that de-identification can be performed even when corpora are dominated by fragmented and incomplete utterances, even when

Acknowledgements

This work was supported in part by the National Institutes of Health through research grants 1 RO1 EB001659 from the National Institute of Biomedical Imaging and Bioengineering and through the NIH Roadmap for Medical Research, Grant U54LM008748. IRB approval has been granted for the studies presented in this manuscript. We thank the anonymous reviewers for their insightful comments and constructive feedback.

References (39)

  • Ö. Uzuner et al.

    Evaluating the state-of-the-art in automatic de-identification

    J Am Med Inform Assoc

    (2007)
  • S. Pyysalo et al.

    Evaluation of two dependency parsers on biomedical corpus targeted at protein–protein interactions

    Int J Med Inform

    (2006)
  • Health Information Portability and Accountability Act, Section 164.514, <http://www.hhs.gov/ocr/AdminSimpRegText.pdf>;...
  • B. Malin et al.

    The effects of location access behavior on re-identification risk in a distributed environment

  • L. Sweeney

    Replacing personally-identifying information in medical records, the Scrub system

  • C. Lovis et al.

    Fast exact string pattern matching algorithms adapted to the characteristics of the medical language

    J Am Med Inform Assoc

    (2000)
  • Chang C, Lin C. LIBSVM: a library for support vector machines. Manual, Department of Computer Science and Information...
  • Sleator D, Temperley D. Parsing English with a link grammar. Technical report CMU-CS-91-196. Pittsburgh, PA, USA:...
  • Grinberg D, Lafferty J, Sleator D. A robust parsing algorithm for link grammars. Technical report CMU-CS-95-125....
  • Douglass M. Computer assisted de-identification of free text nursing notes. Master's thesis. Cambridge, MA, USA:...
  • D. Roth et al.

    Probabilistic reasoning for entity and relation recognition

  • D. Bikel et al.

    An algorithm that learns what's in a name

    Mach Learn J Spec Issue Nat Lang Learn

    (1999)
  • D. Gupta et al.

    Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research

    Am J Clin Pathol

    (2004)
  • B. Beckwith et al.

    Development and evaluation of an open source software tool for de-identification of pathology reports

    BMC Med Inform Decis Making

    (2006)
  • J. Berman

    Concept-match medical data scrubbing: how pathology text can be used in research

    Arch Pathol Lab Med

    (2003)
  • R. Taira et al.

    Identification of patient name references within medical documents using semantic selectional restrictions

  • The ACE 2007 evaluation plan, <http://www.nist.gov/speech/tests/ace/ace07/doc/ace07-evalplan.v1.3a.pdf>; 2007 [accessed...
  • H. Isozaki et al.

    Efficient support vector classifiers for named entity recognition

  • Unified Medical Language System [web page], <http://www.nlm.nih.gov/pubs/factsheets/umls.html>; 2006 [accessed...
  • Cited by (75)

    • DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text

      2018, Telematics and Informatics
      Citation Excerpt :

      For the English language, most notably Neamatullah et al. (2008) obtained a recall of 0.967 using their pattern matching method that was developed on a test corpus of 1836 nursing notes. Uzuner et al. (2008) managed to achieve a 0.97 F1-score on medical discharge summaries, based on a machine learning approach. A hybrid approach was developed by Ferrández et al. (2013), achieving a 0.922 recall by combining both pattern matching and machine learning techniques.

    • The Impact of Language Technologies in the Legal Domain

      2024, Law, Governance and Technology Series
    View all citing articles on Scopus

    This is a thoroughly revised and extended version of the preliminary draft “Role of Local Context in De-identification of Ungrammatical, Fragmented Text” which was presented at the conference of the North American Chapter of Association for Computational Linguistics/Human Language Technology (NAACL-HLT 2006) in June 2006.

    View full text