A de-identifier for medical discharge summaries☆
Introduction
Medical discharge summaries can be a major source of information for many studies. However, like all other clinical records, discharge summaries contain explicit personal health information (PHI) which, if released, would jeopardize patient privacy. In the United States, the Health Information Portability and Accountability Act (HIPAA) provides guidelines for protecting the confidentiality of patient records. Paragraph 164.514 of the Administrative Simplification Regulations promulgated under the HIPAA states that for data to be treated as de-identified, it must clear one of two hurdles:
- 1.
An expert must determine and document “that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information.”
- 2.
Or, the data must be purged of a specified list of seventeen categories of possible identifiers, i.e., PHI, relating to the patient or relatives, household members and employers, and any other information that may make it possible to identify the individual [1]. Many institutions consider the clinicians caring for a patient and the names of hospitals, clinics, and wards to fall under this final category because of the heightened risk of identifying patients from such information [2], [3].
Of the 17 categories of PHI listed by HIPAA, the following appear in medical discharge summaries: first and last names of patients, of their health proxies, and of their family members; identification numbers; telephone, fax, and pager numbers; geographic locations; and dates. In addition, names of doctors and hospitals are frequently mentioned in discharge summaries; for this study, we add them to the list of PHI. Given discharge summaries, our goal is to find the above listed PHI and to replace them with either anonymous tags or realistic surrogates.
Medical discharge summaries are characterized by fragmented, incomplete utterances and domain-specific language. As such, they cannot be effectively processed by tools designed for lay language text such as news articles [4]. In addition, discharge summaries contain some words that can appear both as PHI and non-PHI within the same corpus, e.g., the word Huntington can be both the name of a person, “Dr. Huntington”, and the name of a disease, “Huntington's disease”. They also contain foreign and misspelled words as PHI, e.g., John misspelled as Jhn and foreign variants such as Ioannes. These complexities pose challenges to de-identification.
An ideal de-identification system needs to identify PHI perfectly. However, while anonymizing the PHI, such a system needs to also protect the integrity of the data by maintaining all of the non-PHI, so that medical records can later be processed and retrieved based on their inclusion of these terms. Almost all methods that determine whether a target word,1 i.e., the word to be classified as PHI or non-PHI, is PHI base their decision on a combination of features related to the target itself, to words that surround the target, and to discourse segments containing the target. We call the features extracted from the words surrounding the target and from the discourse segment containing the target the context of the target. In this paper, we are particularly interested in comparing methods that rely on what we call local context, by which we mean the words that immediately surround the target (local lexical context) or that are linked to it by some immediate syntactic relationship (local syntactic context), and global context, which refers to the relationships of the target with the contents of the discourse segment containing the target. For example, the surrounding k-tuples of words to the left and right of a target are common components of local context, whereas a model that selects the highest probability interpretation of an entire sentence by a Markov model employs sentential global context (where the discourse segment is a sentence).
In this paper, we present a de-identifier, Stat De-id, which uses local context to de-identify medical discharge summaries. We treat de-identification as a multi-class classification task; the goal is to consider each word in isolation and to decide whether it represents a patient, doctor, hospital, location, date, telephone, ID, or non-PHI. We use support vector machines (SVMs), as implemented by LIBSVM [5], trained on human-annotated data as a means to this end.
Our representation of local context benefits from orthographic, syntactic, and semantic characteristics of each target word and the words within a ±2 context window of the target. Other models of local context have used the features of words immediately adjacent to the target word; our representation is more thorough as it includes (for a ±2 context) local syntactic context, i.e., the features of words that are linked to the target by syntactic relations identified by a parse of the sentence. This novel representation of local syntactic context uses the Link Grammar Parser [6], which can provide at least a partial syntactic parse even for incomplete and fragmented sentences [7]. Note that syntactic parses can be generally regarded as sentential features. However, in our corpora, more than 40% of the sentences only partially parse. The features extracted from such partial parses represent phrases rather than sentences and contribute to local context. For sentences that completely parse, our representation benefits from syntactic parses only to the extent that they help us relate the target to its immediate neighbors (within two links), again extracting local context.
On five separate corpora obtained from Partners Healthcare and Beth Israel Deaconess Medical Center, we show that despite the fragmented and incomplete utterances and the domain-specific language that dominate the text of discharge summaries, we can capture the patterns in the language of these documents by focusing on local context; we can use these patterns for de-identification. Stat De-id, presented in this paper, is built on this hypothesis. It finds more than 90% of the PHI even in the face of ambiguity between PHI and non-PHI, and even in the presence of foreign words and spelling errors in PHI.
We compare Stat De-id with a rule-based heuristic + dictionary approach [8], two named entity recognizers, SNoW [9] and IdentiFinder [10], and a Conditional Random Field De-identifier (CRFD). SNoW and IdentiFinder also use local context; however, their representation of local context is relatively simple and, for named entity recognition (NER), is complemented with information from sentential global context, i.e., the dependencies of entities with each other and with non-entity tokens in a single sentence. CRFD, developed by us for the studies presented in this paper, employs the exact same local context used by Stat De-id and reinforces this local context with sentential global context. In this manuscript, we refer to sentential global context simply as global context. Because medical discharge summaries contain many short, fragmented sentences, we hypothesize that global context will add limited value to local context for de-identification, and that strengthening the representation of local context will be more effective for improving de-identification. We present experimental results to support this hypothesis: on our corpora, Stat De-id significantly outperforms all of SNoW, IdentiFinder, CRFD, and the heuristic + dictionary approach.
The performance of Stat De-id is encouraging and can guide research in identification of entities in corpora with fragmented, incomplete utterances and even domain-specific language. Our results show that even on such corpora, it is possible to create a useful representation of local context and to identify the entities indicated by this context.
Section snippets
Background and related work
A number of investigators have developed methods for de-identifying medical corpora or for recognizing named entities in non-clinical text (which can be directly applied to at least part of the de-identification problem). The two main approaches taken have been either (a) use of dictionaries, pattern matching, and local rules or (b) statistical methods trained on features of the word(s) in question and their local or global context. Our work on Stat De-id falls into the second of these
Definitions
We define the PHI found in medical discharge summaries as follows:
- •
Patients: include the first and last names of patients, their health proxies, and family members. Titles, such as Mr., are excluded, e.g., “Mrs. [Lunia Smith]patient was …”.
- •
Doctors: include medical doctors and other practitioners. Again titles, such as Dr., are not considered part of PHI, e.g., “He met with Dr. [John Doe]doctor ”.
- •
Hospitals: include names of medical organizations. We categorize the entire institution name as PHI
Hypotheses
We hypothesize that we can de-identify medical discharge summaries even when the documents contain many fragmented and incomplete utterances, even when many words are ambiguous between PHI and non-PHI, and even in the presence of foreign words and spelling errors in PHI. Given the nature of the domain-specific language of discharge summaries, we hypothesize that a thorough representation of local context will be more effective for de-identification than (relatively simpler) local context
Corpora
We tested our methods on five different corpora, three of which were developed from a corpus of 48 discharge summaries from various medical departments at the Beth Israel Deaconess Medical Center (BIDMC), the fourth of which consisted of authentic data including actual PHI from 90 discharge summaries of deceased patients from Partners HealthCare, and the fifth of which came from a corpus of 889 de-identified discharge summaries, also from Partners. The sizes of these corpora and the
Methods: Stat De-id
Categories of PHI are often characterized by local context. For example, the word Dr. before a name invariably suggests that the name is that of a doctor. While titles such as Dr. provide easy context markers, other clues may not be as straightforward, especially when the language of documents is dominated by fragmented and incomplete utterances. We created a representation of local context that is useful for recognizing PHI even in fragmented, incomplete utterances.
We devised Stat De-id, a
Baseline approaches
We compared Stat De-id with a scheme that relies heavily on dictionaries and hand-built heuristics [8], with Roth and Yih's SNoW [9], with BBN's IdentiFinder [10], and with our in-house Conditional Random Field De-identifier (CRFD). SNoW, IdentiFinder, and CRFD take into account dependencies of entities with each other and with non-entity tokens in a sentence, i.e., sentential global context, while Stat De-id focuses on each word in the sentence in isolation, using only local context provided
Precision, recall, and F-measure
We evaluated the de-identification and NER systems on four artificial and one authentic corpora. We evaluated Stat De-id using 10-fold cross-validation; in each round of cross-validation we extracted features only from the training corpus, trained the SVM only on these features, and evaluated performance on a held-out validation set. To compare with the performance of baseline systems, we computed precision, recall, and F-measures for each system. Precision for class x is defined as β/B where β
De-identifying random and authentic corpora
We first de-identified the random and authentic corpora. On the random corpus, Stat De-id significantly outperformed all of IdentiFinder, CRFD, and the heuristic + dictionary baseline. Its F-measure on PHI was 97.63% compared to IdentiFinder's 68.35%, CRFD's 81.55%, and the heuristic + dictionary scheme's 77.82% (see Table 4).7
Multi-class SVM results and implications for future research
The goal of this paper is to separate PHI from non-PHI for de-identification purposes. De-identification is achieved simply by discarding the PHI that are found, or by replacing the PHI with anonymous tags or surrogates. We have so far shown that Stat De-id recognizes 94–97% of the PHI and outperforms all other systems. However, from a policy perspective, the adequacy of the performance of Stat De-id depends on the PHI that are missed. Not all PHI are equally strong identifiers of individuals.
Conclusions
In this paper, we have shown that we can de-identify clinical text, characterized by fragmented and incomplete utterances, using local context 94–97% of the time. Our representation of local context is novel; it includes novel syntactic features which provide us with useful linguistic information even when the language of documents is fragmented. The results presented imply that de-identification can be performed even when corpora are dominated by fragmented and incomplete utterances, even when
Acknowledgements
This work was supported in part by the National Institutes of Health through research grants 1 RO1 EB001659 from the National Institute of Biomedical Imaging and Bioengineering and through the NIH Roadmap for Medical Research, Grant U54LM008748. IRB approval has been granted for the studies presented in this manuscript. We thank the anonymous reviewers for their insightful comments and constructive feedback.
References (39)
- et al.
Evaluating the state-of-the-art in automatic de-identification
J Am Med Inform Assoc
(2007) - et al.
Evaluation of two dependency parsers on biomedical corpus targeted at protein–protein interactions
Int J Med Inform
(2006) - Health Information Portability and Accountability Act, Section 164.514, <http://www.hhs.gov/ocr/AdminSimpRegText.pdf>;...
- et al.
The effects of location access behavior on re-identification risk in a distributed environment
Replacing personally-identifying information in medical records, the Scrub system
- et al.
Fast exact string pattern matching algorithms adapted to the characteristics of the medical language
J Am Med Inform Assoc
(2000) - Chang C, Lin C. LIBSVM: a library for support vector machines. Manual, Department of Computer Science and Information...
- Sleator D, Temperley D. Parsing English with a link grammar. Technical report CMU-CS-91-196. Pittsburgh, PA, USA:...
- Grinberg D, Lafferty J, Sleator D. A robust parsing algorithm for link grammars. Technical report CMU-CS-95-125....
- Douglass M. Computer assisted de-identification of free text nursing notes. Master's thesis. Cambridge, MA, USA:...
Probabilistic reasoning for entity and relation recognition
An algorithm that learns what's in a name
Mach Learn J Spec Issue Nat Lang Learn
Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research
Am J Clin Pathol
Development and evaluation of an open source software tool for de-identification of pathology reports
BMC Med Inform Decis Making
Concept-match medical data scrubbing: how pathology text can be used in research
Arch Pathol Lab Med
Identification of patient name references within medical documents using semantic selectional restrictions
Efficient support vector classifiers for named entity recognition
Cited by (75)
Automated anonymization of text documents in Polish
2021, Procedia Computer ScienceActive deep learning to detect demographic traits in free-form clinical notes
2020, Journal of Biomedical InformaticsA machine learning based approach to identify protected health information in Chinese clinical text
2018, International Journal of Medical InformaticsDEDUCE: A pattern matching method for automatic de-identification of Dutch medical text
2018, Telematics and InformaticsCitation Excerpt :For the English language, most notably Neamatullah et al. (2008) obtained a recall of 0.967 using their pattern matching method that was developed on a test corpus of 1836 nursing notes. Uzuner et al. (2008) managed to achieve a 0.97 F1-score on medical discharge summaries, based on a machine learning approach. A hybrid approach was developed by Ferrández et al. (2013), achieving a 0.922 recall by combining both pattern matching and machine learning techniques.
The Impact of Language Technologies in the Legal Domain
2024, Law, Governance and Technology Series
- ☆
This is a thoroughly revised and extended version of the preliminary draft “Role of Local Context in De-identification of Ungrammatical, Fragmented Text” which was presented at the conference of the North American Chapter of Association for Computational Linguistics/Human Language Technology (NAACL-HLT 2006) in June 2006.