Elsevier

Data & Knowledge Engineering

Volume 68, Issue 12, December 2009, Pages 1441-1451
Data & Knowledge Engineering

An integrated framework for de-identifying unstructured medical data

https://doi.org/10.1016/j.datak.2009.07.006Get rights and content

Abstract

While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any information that can be used to identify a patient. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identifiability and developing techniques for anonymization but are focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community rely on simple identifier removal or grouping techniques without taking advantage of the research developments in the data privacy community. This paper attempts to fill the above gaps and presents a framework and prototype system for de-identifying health information including both structured and unstructured data. We empirically study a simple Bayesian classifier, a Bayesian classifier with a sampling based technique, and a conditional random field based classifier for extracting identifying attributes from unstructured data. We deploy a k-anonymization based technique for de-identifying the extracted data to preserve maximum data utility. We present a set of preliminary evaluations showing the effectiveness of our approach.

Introduction

Current information technology enables many organizations to collect, store, and use various types of information about individuals. The government and organizations are increasingly recognizing the critical value of sharing such a wealth of information. However, individually identifiable information is protected under the Health Insurance Portability and Accountability Act (HIPAA).1

The National Cancer Institute initiated the shared pathology informatics network (SPIN)2 for researchers throughout the country to share pathology-based data sets annotated with clinical information to discover and validate new diagnostic tests and therapies. Fig. 1 shows a sample pathology report section with personally identifying information such as age and medical record number highlighted. It is necessary for each institution to de-identify or anonymize the data before having it accessible by the network. This network of shared data consists of both structured and unstructured data of various formats. Most medical data is heterogeneous meaning that even structured data from different institutions are labeled differently and unstructured data is inherently heteregeneous. We use the terms heterogeneous and unstructured data interchangeably throughout this paper.

Currently, investigators or institutions wishing to use medical records for research purposes have three options: obtain permission from the patients, obtain a waiver of informed consent from their Institutional Review Boards (IRB), or use a data set that has had all or most of the identifiers removed. The last option can be generalized into the problem of de-identification or anonymization (both de-identification and anonymization are used interchangeably throughout this paper) where a data custodian distributes an anonymized view of the data that does not contain individually identifiable information to a (data recipient). It provides a scalable way for sharing medical information in large scale environments while preserving privacy of patients.

At the first glance, the general problem of data anonymization has been extensively studied in recent years in the data privacy community [11]. The seminal work by Sweeney et al. shows that a dataset that simply has identifiers removed is subject to linking attacks [35]. Since then, a large body of work contributes to data anonymization that transforms a dataset to meet a privacy principle such as k-anonymity using techniques such as generalization, suppression (removal), permutation and swapping of certain data values so that it does not contain individually identifiable information, such as [15], [41], [4], [1], [12], [5], [47], [21], [22], [43], [46].

While the research on data anonymization has made great progress, its practical utilization in medical fields lags behind. An overarching complexity of medical data, but often overlooked in data privacy research, is data heterogeneity. A considerable amount of medical data resides in unstructured text forms such as clinical notes, radiology and pathology reports, and discharge summaries. While some identifying attributes can be clearly defined in structured data, an extensive set of identifying information is often hidden or have multiple and different references in the text. Unfortunately, the bulk of data privacy research focus exclusively on structured data.

On the other hand, efforts on de-identifying medical text documents in medical informatics community [33], [34], [37], [36], [14], [32], [3], [39] are mostly specialized for specific document types or a subset of HIPAA identifiers. Most importantly, they rely on simple identifier removal techniques without taking advantage of the research developments from data privacy community that guarantee a more formalized notion of privacy while maximizing data utility.

Our work attempts to fill the above gaps and bridge the data privacy community and medical informatics community by developing a framework and prototype system, HIDE, for Health Information DE-identification of both structured and unstructured data. The contributions of our work are two fold. First, our system advances the medical informatics field by adopting information extraction (also referred to as attribute extraction) and data anonymization techniques for de-identifying heterogeneous health information. Second, the conceptual framework of our system advances the data privacy field by integrating the anonymization process for both structured and unstructured data. The specific components and contributions of our system are as follows:

  • Identifying and sensitive information extraction. We leverage and empirically study existing named entity extraction techniques [25], [30], in particular, simple Bayesian classifier and sampling based techniques, and conditional random fields based techniques to effectively extract identifying and sensitive information from unstructured data.

  • Data linking. In order to preserve privacy for individuals and apply advanced anonymization techniques in the heterogeneous data space, we propose a structured identifier view with identifying attributes linked to each individual.

  • Anonymization. We perform data suppression and generalization on the identifier view to anonymize the data with different options including full de-identification, partial de-identification, and statistical anonymization based on k-anonymization.

While we utilize off-the-shelf techniques for some of these components, the main contribution of our system is that it bridges the research on data privacy and text management and provides an integrated framework that allows the anonymization of heterogeneous data for practical applications. We evaluate our prototype system through a set of real-world data and show the effectiveness of our approach.

In the rest of the paper we first describe related work. Then we describe our de-identification system including privacy models, the conceptual framework, identifier/attribute extraction, data linking, and anonymization. We then describe our experiments and results. Finally we conclude and describe further avenues of future work.

Section snippets

Related work

Our work is inspired and informed by a number of areas. We briefly review the most relevant areas below and discuss how our work leverages and advances the current state-of-the-art.

De-identification system

We first present the privacy and de-identification models used in our system, then present the conceptual framework behind our system, followed by a discussion on each component with its research challenges and proposed solutions.

Experiments

We conducted a set of preliminary experiments on a real-world dataset. In this section, we first describe our dataset and experiment setup and then present the preliminary results demonstrating the effectiveness of our approach.

Conclusion and future works

We presented a conceptual framework as well as a prototype system for anonymizing heterogeneous health information including both structured and unstructured data. Our initial experimental results show that our system effectively detects a variety of identifying attributes with high precision, and provides flexible de-identification options that anonymizes the data with a given privacy guarantee while maximizing data utility to the researchers. While our work is a convincing proof-of-concept,

Acknowledgements

This research is partially supported by an Emory URC grant and Emory ITSC grant. We thank the guest editors and anonymous reviewers for their valuable comments that improved this paper.

James Gardner received his BS in Mathematics and Computer Science from East Tennessee State University and his MS in Computer Science from Emory University. He is currently pursuing his Ph.D. at Emory. His current research is focused on Machine Learning, Natural Language Processing, and Semantic Web Technologies.

References (47)

  • C.C. Aggarwal, On k-anonymity and the curse of dimensionality, in: Thirty-first International Conference on Very Large...
  • G.Aggarwal, T.Feder, K.Kenthapadi, S.Khuller, R. Panigrahy, D. Thomas, A. Zhu, Achieving anonymity via clustering, in:...
  • R.M.B.A. Beckwith et al.

    Development and evaluation of an open source software tool for deidentification of pathology reports

    BMC Medical Informatics and Decision Making

    (2006)
  • R.J. Bayardo, R. Agrawal, Data privacy through optimal k-anonymization, in: ICDE’05: Proceedings of the 21st...
  • E. Bertino, B. Ooi, Y. Yang, R.H. Deng. Privacy and ownership preserving of outsourced medical data, in: Proceedings of...
  • I. Bhattacharya, L. Getoor, Iterative record linkage for cleaning and integration, in: DMKD’04: Proceedings of the 9th...
  • Y. Bu, A. Fu, R. Wong, L. Chen, J. Li. Privacy preserving serial data publishing by role composition, in: Thirty-fourth...
  • G. Cormode, D. Srivastava, T. Yu, Q. Zhang, Anonymizing bipartite graph data using safe groupings, in: Thirty-fourth...
  • A. Culotta, A. McCallum, J. Betz, Integrating probabilistic extraction models and data mining to discover relations and...
  • X. Dong, A. Halevy, J. Madhavan, Reference reconciliation in complex information spaces, in: SIGMOD’05: Proceedings of...
  • B.C.M. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey on recent developments, ACM...
  • B.C.M. Fung, K. Wang, P.S. Yu, Top-down specialization for information and privacy preservation, in: Proceedings of the...
  • L. Gu, R. Baxter, D. Vickers, C. Rainsford, Record linkage: current practice and future...
  • D. Gupta et al.

    Evaluation of a deidentification (de-id) software engine to share pathology reports and clinical documents for research

    American Journal of Clinical Pathology

    (2004)
  • V.S. Iyengar, Transforming data to satisfy privacy constraints, in: Proceedings of the Eighth ACM SIGKDD International...
  • P. Jurczyk, J.J. Lu, L. Xiong, J.D. Cragan, A. Correa, Fril: a tool for comparative record linkage, in: AMIA 2008...
  • D.V. Kalashnikov, S. Mehrotra, Z. Chen, Exploiting relationships for domain-independent data cleaning, in: SIAM...
  • D. Kifer, J. Gehrke. Injecting utility into anonymized datasets, in: SIGMOD Conference, 2006, pp....
  • J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmenting and labeling...
  • R. Leaman, G.G. Banner, An executable survey of advances in biomedical named entity recognition, in: Pacific Symposium...
  • K. LeFevre, D. Dewitt, R. Ramakrishnan, Incognito: efficient full-domain k-anonymity, in: ACM SIGMOD International...
  • K. LeFevre, D. DeWitt, R. Ramakrishnan, Mondrian multidimensional k-anonymity, in: Proceedings of the 22nd...
  • N. Li, T. Li, T-closeness: privacy beyond k-anonymity and l-diversity, in: Proceedings of the 23rd International...
  • Cited by (55)

    • De-identifying Australian hospital discharge summaries: An end-to-end framework using ensemble of deep learning models

      2022, Journal of Biomedical Informatics
      Citation Excerpt :

      In order to efficiently deploy customisable de-identification solutions, tools are required to facilitate training dataset annotation, model training, and redaction of identifier text. An integrated framework, the Health Information DE-identification (HIDE) was proposed by Gardner et al. [25] for de-identifying health information including both structured and unstructured data. HIDE relied on Naïve Bayes conditional random field (CRF) algorithms to train the NER models and provided an environment for tagging, classifying, and retagging of PII.

    • Clinical information extraction applications: A literature review

      2018, Journal of Biomedical Informatics
      Citation Excerpt :

      For example, the clinical e-science framework (CLEF) [180], a UK MRC–sponsored project, aims to establish policies and infrastructure for clinical data sharing of cancer patients to enable the next generation of integrated clinical and bioscience research. However, no prior effort exists for privacy-preserving computing (PPC) on NLP artifacts with distributional information [181,182]. PPC strategies could combine different forms provided by different data resources within the topic of privacy restrictions.

    • Optimizing annotation resources for natural language de-identification via a game theoretic framework

      2016, Journal of Biomedical Informatics
      Citation Excerpt :

      Various tools based on natural language processing (NLP) have been engineered to detect and manipulate identifiers in EMR data [23]. Many of these tools are based on machine learning techniques, such as maximum entropy models [24], decisions trees [19], support vector machines [25], and conditional random fields (CRF) [17,18,26,27]. Existing machine learning approaches to de-identification strive to maximize performance in terms of standard information retrieval measures, such as precision, recall, or a balanced F-measure [28–31] at the token level (individual word) and instance level (phrase, e.g., first and last name).

    View all citing articles on Scopus

    James Gardner received his BS in Mathematics and Computer Science from East Tennessee State University and his MS in Computer Science from Emory University. He is currently pursuing his Ph.D. at Emory. His current research is focused on Machine Learning, Natural Language Processing, and Semantic Web Technologies.

    Li Xiong is an Assistant Professor of Mathematics and Computer Science at Emory University. She holds a Ph.D. from Georgia Institute of Technology and an MS from Johns Hopkins, both in Computer Science. She also worked as a software engineer in IT industry for several years prior to pursuing her doctorate. Her areas of interests are in data and information management, data privacy and security, and bio and health informatics. She is a recipient of a Career Enhancement Fellowship from the Woodrow Wilson Foundation.

    View full text