Data sharing without sharing data

What are the ways that data and insights can be shared without having to expose individual-level data to others? At the 2022 All-Actuaries Summit, Basem Morris and Tim Scott presented on modern Privacy Enhancing Techniques that can deliver on this promise.

There are many potential advantages from organisations and individuals sharing data. Retailers could understand demographic and share-of-wallet information about their customers via financial institutions.

Combining insurance data can give a market view of coverage, concentration risks and underinsurance. Pooling health data can reveal important trends or suggest risk factors. Autonomous vehicles can share up-to-date road information if traffic conditions are unexpected.

However, data sharing creates its own set of problems. Do customers reasonably expect that their data will be shared? And how do ensure that a user of your data will protect the data securely and dispose of it when requested? 

At Basem and Tim’s session they introduced a range of Privacy Enhancing techniques (PETs) that can enable insights to be generated and shared without having to provide individual-level information. They introduced five examples of PETs:

  1. Differential privacy: Noise is randomly added to data (either at input or output), so that even if part of the underlying data is known, you cannot infer much information about the remainder.

  2. Federated analysis: Where organisations build a model on their component of data, and then the resulting models are (sensibly) combined. This replaces the need to collate the entire dataset together before analysis.

  3. Zero-knowledge proofs: These are techniques where a person demonstrates they know a value without the other party needed to see it.

  4. Secure multiparty computation: These are more complex transformation and exchange of data so that a combined analysis is completed, but individuals are only able to see obfuscated versions of other people’s data.

  5. Homomorphic encryption: Where data is encrypted but retains some mathematic properties after encryption. Analysis can be done on encrypted data and model parameters extracted.

Most of these techniques demonstrate an ingenuity of thought. For example, a toy example given by Basem for zero-knowledge proofs was whether Person A can prove that they can taste the difference between a Coke and a Pepsi without revealing which glass is which.

While Person A looks away, Person B can switch the glasses (or not), and then ask Person A if the glasses have been changed. If A answers correctly enough times, this amounts to ‘proof’ they can distinguish the difference.

With these techniques established we can look at the implications. A few thoughts were offered during the session:

  • Many of the techniques are not especially new. For instance, the UN even published a handbook of privacy-preserving techniques in 2015. However, the use of these tools is not routine, and so the question is how quickly demand grows as more people become aware of the opportunities.

  • Many techniques also carry trade-offs. Differential privacy will reduce the accuracy of the output (in exchange for obfuscating the underlying data). Homomorphic encryption carries some significant computational overhead. Recent efforts include finding ways to reduce this overhead. For example, Intel is developing hardware that can more efficiently support homomorphic encryption.

  • The technology sector offers examples of these techniques applied at scale. For example, predictive text on smartphones relies on language models built on data collected from people’s typing.

  • However, collecting all typed text from devices would be a very large privacy intrusion. So instead both Google and Apple use a federated modelling approach – the language model is updated on device, and these changes are fed back to the central model, rather than the underlying data. Similarly digital advertising is making greater use of privacy-preserving techniques. The Chrome browser holomorphically encrypts passwords that allows secure storage, syncing, and the ability to test if passwords have been uncovered in known data breaches.

  • For companies looking to ‘share’ data, they have more options. At one extreme they could encrypt their data, receive queries that run on their servers on encrypted data and return results. This would mean that data never leaves a company, simplifying many data governance considerations.

While a potentially exciting option for companies interested in sharing data, it is also important to recognise that many things do not change. The reasons why data is being shared, and the use of the outputs, still need to be clearly set out and consistent with a company’s privacy obligations.

Competition laws are still relevant when it comes to sharing data and statistics within an industry. And there will likely be grey areas where judgements made on when data is actually being shared, particularly for fields derived from personal or identifying information. For better or worse, we are not done with data governance committees just yet.

CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.