05/14/2020

Redaction and Re-identification Risk

This post, authored by now-retired attorney Peter Guffin, is part one in a series examining privacy and transparency issues in the context of public access to digital court records

In its proposed electronic court records access rules, the Maine Supreme Judicial Court (SJC) imposes on litigants new and extensive filing obligations, including requiring litigants to redact certain categories of sensitive personal information.

Regardless of what one might think about the wisdom of placing this burden on litigants, it is important to ask what the SJC hopes to achieve by this requirement. Even assuming full compliance, which is doubtful, redaction as a de-identification technique, without more, would be wholly inadequate to protect the privacy of Maine citizens.

In today’s big data world, given the sophistication of data handlers, it is well-recognized that de-identification alone is not enough to prevent re-identification of individuals, and the SJC’s reliance on it promotes a false sense of security. The risk of re-identification of individuals from purportedly de-identified databases is significant.

As pointed out in my essay, “As long ago as 2010, Paul Ohm, a leading privacy scholar, brought attention to the fact that computer scientists “have demonstrated that they can often ‘reidentify’ or ‘deanonymize’ individuals hidden in anonymized data with astonishing ease.” In his groundbreaking article examining this research, Ohm described in detail three spectacular failures of anonymization to reinforce his point that “we have made a mistake, labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed.” Each of these incidents – the 2006 AOL data release, the Massachusetts Group Insurance Commission’s release of “de-identified” medical records, and the 2006 Netflix prize data study – has been widely publicized.”

In each of these incidents, said Ohm, “even though administrators had removed any data fields they thought might uniquely identify individuals, researchers . . . unlocked identity by discovering pockets of surprising uniqueness remaining in the data.”

The Federal Trade Commission (FTC), in its privacy framework published in 2012, likewise concluded that “[t]here is significant evidence demonstrating that technological advances and the ability to combine disparate pieces of data can lead to identification of a consumer, computer, or device even if the individual pieces of data do not constitute [personally identifiable information (PII)].” Moreover, continued the FTC, “not only is it possible to re-identify non-PII data through various means, businesses have strong incentives to actually do so.”

The FTC’s privacy framework, which requires organizations to implement three significant protections for data to minimize the risk of re-identification, established a best practices standard that is widely accepted. First, the organization “must take reasonable measures to ensure that the data is de-identified. This means that the [organization] must achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular [individual], computer, or other device.” Second, the organization must “publicly commit to maintain and use the data in a de-identified fashion, and not to attempt to re-identify the data.” Third, if the organization “makes such de-identified data available to other [persons], it should contractually prohibit such persons from attempting to re-identify the data.

Echoing the FTC, in 2014 the President’s Council of Advisors on Science and Technology (PCAST) concluded that “[a]nonymization remains somewhat useful as an added safeguard, but it is not robust against near-term future re-identification methods. PCAST does not see it as being a useful basis for policy.”

Closer to home, the Maine Health Data Organization (MHDO), an independent executive branch agency that maintains a comprehensive health information database comprising health care information about Maine citizens collected from health care facilities and payors, has established rules designed to make data publicly available and accessible to the broadest extent consistent with the laws protecting individual privacy and proprietary information.

Recognizing the ease with which purportedly de-identified data can be re-identified, the MHDO rules define “direct patient identifiers” as “[i]nformation such as name, social security number, and date of birth, that uniquely identifies an individual or that can be combined with other readily available information to uniquely identify an individual.” The MHDO definition mirrors the safe harbor definition of “de-identified” health data under HIPAA, which mandates removal of 18 categories of identifiers from a data file, in addition to requiring that the data file “[cannot without actual knowledge of the covered entity] be used alone or in combination with other information to identify an individual who is a subject of the information.”

The MHDO rules, both in design and practice, essentially implement each of the protections for data established as a best practices standard by the FTC to minimize the risk of re-identification. For example, under the MHDO rules, release to the public of any de-identified data or limited data sets is made conditional on the recipient’s agreement to abide by the terms of the MHDO’s standard data use agreement which, among other things, requires that the recipient only use the released data in ways that maintain patient anonymity and prohibits the recipient from “link[ing] these data to other records or data bases” in an attempt to identify any individuals.

Two recent trial court orders in Maine involving the release of sensitive personal information in the context of discovery provide a useful window into how judges attempt to minimize the risk of re-identification to protect the privacy of Maine citizens. They offer valuable lessons for digital court records access, even though the orders involve discovery documents, which are not part of the public record unless they are introduced at trial, so they are somewhat different than redacted pleadings filed by litigants.

Both orders in effect implement the FTC’s best practices standard for protection of data and adopt the data use agreements model used by the MHDO. In doing so, they reflect a reasonable and pragmatic approach to balancing competing interests.

In his 2018 order on plaintiff’s motion to compel production of discovery in the Kennelly v. Mid Coast Hospital case currently pending before the Maine Law Court, Justice Lance Walker took reasonable measures to ensure that the released data was sufficiently de-identified. He also required that the data be used by plaintiff “solely for the purposes of prosecuting her claim before the court” and ordered “[p]laintiff’s counsel [not to] attempt to identify persons whose identities have been redacted and . . . not provide copies of the [data] to anyone, other than expert witnesses in the case.” Finally, he ordered that “[a]ny expert witness shall be required to not share the [data] with anyone, to use such [data] only for the purposes of this case, and to return the [data] to [p]laintiff’s counsel at the end of the case.”

The Kennelly order echoed the standards set forth by Justice Ann Murray in McCain v. Vanadia in 2017.

While one might quibble whether additional safeguards should have been put in place by Justices Walker and Murray to protect patient privacy, the important point is that in both cases, attuned to the risk of re-identification, the court in effect adopted as its analytical guidepost the core principles in the privacy framework established by the FTC and implemented the same “data use agreement model” used by the MHDO.

In crafting rules addressing electronic court records access, given the significant risk of re-identification, the SJC will need to do much more than simply impose redaction responsibility on litigants if it hopes to protect the privacy of Maine citizens.