The token as a concept has several different meanings in our modern lexicon. When thinking of tokens, you might picture the coin you used as a child to play an arcade game, the latest cryptocurrency craze, or something less concrete such as a token of your affection for others. In all of these cases, including for our purposes as a privacy-preserving tool, the ‘physical’ token is representative of some other value we all agree to acknowledge. This is no less true in the privacy-preserving context!
Tokens are generated from data, typically personally identifiable information (PII), to act as a persistent representative of the individual or identifying characteristics associated with a record of data. This is done via a process used to de-identify sensitive data called tokenization.
What is Tokenization?
Tokenization is the process through which one “substitutes a sensitive identifier (e.g., a unique ID number or other PII) with a non-sensitive equivalent (i.e., a ‘token’) that has no extrinsic or exploitable meaning or value” [1]. In simpler terms, a computer will take inputs of columns from data such as a person’s name, phone number, email address, home address, or national identifiers and “scramble” them such that all of the identifying columns are encrypted to represent a new tokenized identifier which is not human-readable. This means that if a database with tokenized data were to be hacked, the risk of data exploitation would be minimised because the hacker would not be able to identify anyone in the dataset solely by stealing the data itself (they would have to put in a little more elbow grease than that!).
Tokenization can be applied across individual columns of a dataset or against a combination of columns depending on the purpose for which the data is being used. The basic idea is that the same combination of input identifying information should result in the same output token across disparate datasets. An example of this process is as follows:
Notice in this case how it is likely that the first two records represent the same person, but they output different tokens. That is because for this example, the algorithm acts on the combination of the identifying fields. If those differ, as the email field does in this example, the output token will differ. This is not the case for all tokenization processes!
Different algorithms may include rules to handle different combinations of identifying data. For example, a different algorithm might decide to ignore other fields if the Tax_No is the same across records and assign the same token to records with the same Tax_No. This is why it’s important to understand how the tokenization process will make decisions based on your data inputs prior to engaging in data collaboration based on tokens — If you and your collaborator use different algorithms, even with slightly different rules, you are likely to drastically reduce or eliminate any potential for collaboration across datasets!
It’s also useful to keep in mind that there are different types of token creation logics. Some tokens are created deterministically, which means the exact combination of data will always output the same token as in the example above. Others are created probabilistically, which means additional data is used to determine whether the token is likely to be linked to a known deterministic token and should thus be assigned the same value. Still others are considered to be ephemeral tokens, meaning the same data will generate the same token for a given period of time. Thereafter, the same data will then generate a different token [2]. Each logic has its own set of use cases for which it best works depending on the goal of tokenization.
Another important consideration in the tokenization process is how access controls will be managed. If a party wishes to claim its data has been pseudonymised or de-identified via the tokenization process, they typically need to ensure that none of their users have access to a combination of the input databases/files, tokenization algorithm/environment, and output databases/files. This is because when used in combination in an attack, multiple parts of the process may be used to ‘reverse-engineer’ a token back to its original PII.
Most commercially available tokenization processes enforce various access controls, including physical or logical separation of databases, role-based access controls, and other measures to ensure de-identification is upheld. However, be aware when using open-source algorithms for entity resolution to tokens that the process itself does not guarantee pseudonymisation of data unless these types of controls are also in place.
How is tokenization applied in practice?
Tokenization is used in many industries for practical applications where de-identification is required or preferred to mitigate privacy breach risk. Tokens are standard identifiers in advertising technology systems and financial services databases, and are commonly used in healthcare contexts to prevent re-identification of sensitive patient data records.
Tokenization is currently the gold standard process for enabling data sharing and collaboration under the United States’ Health Insurance Portability and Accountability Act (HIPAA), which was enacted in 1996 and includes provisions regulating privacy requirements for the handling of patient data. Under HIPAA, an expert must determine that there is minimal risk of re-identification of de-identified patient data any time two data providers wish to combine or share datasets. Tokens are typically used as the vehicle for joining disparate healthcare datasets, and before this join occurs, an expert statistician verifies that doing so will not put the patients in the dataset(s) at risk. This process has enabled a number of use cases in healthcare, including the inclusion of social determinants of health (SDOH) data such as transaction data in clinical trials, de-identified COVID-19 research databases across institutions, and collaborations across academia and industry [3]. Other countries and institutions, including the United Kingdom’s National Health Service (NHS), have also adopted tokens to varying degrees for similar collaborations.
Bitfount supports the use of tokenized data within our platform and does perform a form of tokenization when enabling data scientists to run private set intersections between a local dataset and a partner’s dataset.
What are the limitations of tokenization?
As briefly described above, tokenization is not a fool-proof privacy-preservation tool. The primary reason most tokenization processes refer to output data as ‘de-identified’ or ‘pseudonymised’ rather than ‘anonymised’ is because, while it may be challenging for the unsophisticated attacker, it is possible to re-identify tokenized data under the right conditions [4].
Re-identification can occur if an attacker (or even an accidental attacker):
1. Accesses the encryption algorithm and reverse engineers it against tokenized data.
2. Has access to both the input and output data and is able to match unique or semi-unique records across the datasets based on the attributes associated to the tokens and PII records.
3. Has access to additional tokenized data sources not contemplated in the original tokenization process architecture and is able to perform an attack which allows them to exploit the similarities and differences in the datasets to identify unique data subjects.
4. Has access to the end-to-end process and can use an individual’s PII to generate a token and subsequently search for the specified token in a tokenized dataset.
For these reasons, tokenized systems often put several risk mitigation measures in place to maintain tokenized data’s status as de-identified:
1. Access controls: Users with access to the algorithm typically are not permitted access to the data environments and vice versa. Ideally, no one has access to both input and output data environments, however in some instances, commercial tokenization platforms will have a named list of employees with access to multiple systems and will log actions on the system to ensure no breach occurs.
2. Environmental controls: De-identified data is always kept at least logically or physically separated from the raw identifiable data from the original database or files. Access to these environments is typically monitored, and controls are in place to prevent raw data from entering the de-identified environment and vice versa. These controls could also include restricting the datasources to mitigate the risk of including datasets which are not permitted for analysis within the environment.
3. Statistical controls: In some cases, systems will also perform statistical tests to mitigate the risk of re-identification. This could include analysis of any incoming dataset to ensure its addition to the de-identified database/environment will not, in combination with any other dataset which already exists in the environment, put data subjects at risk of identification.
While most commercial applications for tokenization will strictly enforce these controls, it is important to recognise that tokenization alone does not prevent privacy leakage. For this reason, it is often used in combination with other privacy-preserving data analysis techniques, such as differential privacy, when applied for data collaboration purposes.
Resources for further learning
[1] https://id4d.worldbank.org/guide/tokenization
[2] https://piiano.com/blog/practical-pseudonymization-by-tokenization/
[4] https://pubmed.ncbi.nlm.nih.gov/22164229/