What is Differential Privacy?
Differential Privacy addresses the paradox of learning nothing about an individual while learning useful information about a population. Generally speaking, differential privacy aims to provide rigorous, statistical guarantees against what an adversary can infer from learning the result of some randomised algorithm.
Typically, differentially private techniques protect the privacy of individual data subjects by adding random noise when producing statistics. In a nutshell, differential privacy guarantees that an individual will be exposed to the same privacy risk whether or not her data is included in a differentially private analysis. This means applying differential privacy when analysing a dataset shouldn’t lead the analyst to learn anything from the dataset about a specific individual that they wouldn’t already know if the individual weren’t in the dataset at all.
More formally, Differential Privacy is a rigorous mathematical definition of privacy. In the simplest setting, consider an algorithm that analyses a dataset and computes statistics about it (such as the data's mean, variance, median, mode, etc.). Such an algorithm is said to be differentially private if by looking at the output, one cannot tell whether any individual's data was included in the original dataset or not. In other words, the guarantee of a differentially private algorithm is that its behaviour hardly changes when a single individual joins or leaves the dataset -- anything the algorithm might output on a database containing some individual's information is almost as likely to have come from a database without that individual's information. Most notably, this guarantee holds for any individual and any dataset. Therefore, regardless of how eccentric any single individual's details are, and regardless of the details of anyone else in the database, the guarantee of differential privacy still holds. This gives a formal guarantee that individual-level information about participants in the database is not leaked.
Of course, a mathematical definition alone does not help protect privacy by itself. There must be a mechanism and a proof that the mechanism reduces down to the math. Typically, differential privacy works by adding some noise to the data. The amount of noise added is a trade-off – Adding more noise improves the privacy, but it also makes the data less useful. In differential privacy, this trade-off is formally controlled using a parameter called epsilon (ε).
When you use truly random noise to anonymise your data, every time you query the same data, you reduce the level of anonymisation. This is because you are able to use the aggregate results to reconstruct the original data by filtering out the noise through averaging. To prevent you from using this method to re-identify data, the value of ε is then used to determine how strict the privacy-preservation is depending on the sensitivity of the data. Each time an analyst then runs a query against a protected dataset, they elect to ‘spend’ a portion of ε set for the given dataset (the ‘budget’). Contrary to intuition, the smaller the value of the privacy budget, the better the privacy protection. However, a small value also reduces the accuracy of any results from analysing the data. This is because the smaller the value of ε, the fewer times you can access the data and the less you can ‘spend’ per query (hence, ε is proportional to your privacy budget).
Why Differential Privacy?
Differential privacy is one of the few privacy-preserving techniques which provides a flexible mechanism for privacy protection that can be context-dependent. This means it is a great tool for data collaboration between parties to provide mathematical guarantees of privacy while still enabling a data custodian to determine how “much” privacy needs to be preserved for a given relationship.
For example, a data custodian for sensitive patient health data would likely want to set a small value of ε against their dataset regardless of who is analysing it because this would provide a high level of privacy protection for the patients in the dataset. However, a data custodian with grocery store or retail transaction data might set a higher value of ε because they are protecting against competitive or reputational risk when using differential privacy, rather than the data itself being at high risk of discriminatory or other abuse. This higher value of ε will make the data more useful to the data custodian’s partners who are analysing it, without putting the consumers in the dataset at undue risk.
Differential privacy currently is in limited use at scale for commercial purposes due to market education and a lack of agreed upon standards for epsilon across common use cases (for example, there was heady debate over the US Census Bureau’s decision to assign an ε of 6 for access to census data, with some commentators stating that was too high and others questioning how the Bureau arrived at that number). However, DP is experiencing rapid uptake in the data science platform, healthcare, financial services, advertising technology, and cloud warehouse platform industries. Deep learning with differential privacy opens up a whole new world of important use cases.
How do we use Differential Privacy at Bitfount?
When collaborating on sensitive datasets or performing research on the use of differential privacy, data custodians or scientists may wish to apply differential privacy to the analysis of their data. The techniques in the Bitfount platform allow data scientists to apply differential privacy to SQL queries as well as models. Bitfount platform comes equipped with large language models differential privacy.
Privately Query a Pod. We can run a private SQL query on a Pod by simply specifying the same query we would normally run as a parameter to the PrivateSQLQuery algorithm. However, there are some limitations on what can be included in the query. Primarily, any column that is SELECTed on must be used either as part of a GROUP BY or in an aggregate function (AVG , SUM, COUNT, etc). We also do not yet support JOIN statements, as this would require additional checks and guarantees. This is because the privacy guarantees offered by differential privacy are highly dependent on the data, and thus cannot be easily automated.
Training a Model on a Dataset with Differential Privacy. We also enable differentially private model training by leveraging the opacus library. Opacus is a high speed library with the goal to preserve the privacy of each training sample while limiting the impact on the accuracy of the final model. Opacus does this by modifying a standard PyTorch optimizer in order to enforce (and measure) DP during training. More specifically, their approach is centred on differentially private stochastic gradient descent (DP-SGD).
The core idea behind this algorithm is that we can protect the privacy of a training dataset by intervening on the parameter gradients that the model uses to update its weights, rather than the data directly. By adding noise to the gradients in every iteration, the model is prevented from memorising its training examples while still enabling learning in aggregate. The (unbiased) noise will naturally tend to cancel out over the many batches seen during the course of training.
If you are interested to see how all of this works, have a look at our Privacy Preserving Techniques Tutorial.
References for further learning:
- https://scholarship.law.vanderbilt.edu/cgi/viewcontent.cgi?article=1058&context=jetlaw#:~:text=This primer aims to provide,regulations for robust privacy protection.
- https://emilianodc.com/PAPERS/PPGM-report.pdf
- https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
- https://privacytools.seas.harvard.edu/differential-privacy