Tools for De-Identification of Personal Health Information

Edit this page


This report identifies useful and available tools and techniques for the deidentification of personal information from interoperable electronic health records and health information related data warehouses. Section 1 contains a general introduction and defines some of the terms commonly used when discussing de-identification. Section 2 describes the principles of de-identification, including two basic models of how data warehouse are operated; the distinction between record-level and aggregate data; the distinction between direct and indirect identifiers; a description of the types of secondary uses that data warehouses support (health research, health system planning, public health surveillance, and generation of deidentified data for system testing); an explanation of k-anonymity as a measure of de-identification; a discussion of the special problems inherent in free-text data; a discussion of the special problems posed by genetic information; and guidance to prevent unintended disclosures through lapses in security. Section 3 describes the various approaches to de-identification, including a flow diagram of when to use each approach. Record-level data can be de-identified through data reduction (including removal of direct identifiers, reduction in the detail of the data, and sampling), data modification (random addition of noise to the data, randomization of data values, and data swapping), and data suppression. Each approach is briefly described and examples are given. Pseudonymisation is described; including the distinction between reversible and irreversible pseudonymisation, and the two basic ways in which pseudonymisation is carried out. Aggregate data has its own approaches to de-identification, including restriction-based methods (cell suppression and changing the classification scheme for the data) and heuristics. These are described with examples. Section 5 contains some best practices for de-identification. These include how direct identifiers should be handled in data warehouses; how date variables should be presented in released datasets; how location data such as postal codes should be handled in released datasets; special guidelines for diagnostic imaging data; how pseudonymous IDs should be handled in released datasets to prevent unintended data linkages to other datasets; and elements of contractual agreements on use and disclosure of datasets. Section 5 contains a description of readily available tools for de-identification of both record-level data and aggregate data. Tools described for handling direct identifiers in record-level data include Oracle Data Masking Pack, Camouflage, Informatica Data Privacy, and Data Masker. Tools for handling indirect identifiers in record-level data include PARAT, μ-Argus, and the Cornell Anonymization Toolkit. Additional tools for aggregate data include τ-ARGUS. Additional tools are also discussed for postal code conversion. Third-party evaluation of de-identification (or the lack thereof) is also briefly discussed. Section 6 briefly discusses re-identification risks. The report concludes with two observations: that the tools described can significantly reduce the risk of reidentification but only when sensibly combined with administrative controls such as end-user agreements and good security practices; and that the de-identified data will only be of value to the end-users if the approach to de-identification supports the intended use.

Link to resource:

Type of resources: Reading

Education level(s): College / Upper Division (Undergraduates)

Primary user(s): Administrator, Researcher

Subject area(s): Life Science, Social Science

Language(s): English