GDPR: Test Data Privacy Back in the Spotlight

By Marcin Grabinski, Technical Solution Specialist, Compuware Corp.


In April, the EU passed the General Data Protection Regulation (GDPR), a set of rules aimed at strengthening data protection for EU consumers. The GDPR will enter into force on May 25, 2018, and any company – yes, even those outside of the EU – possessing data on EU consumers must not only comply, but also prove their ability to do so. May 2018 may seem like a long time away, but it’s not considering the degree of changes required across the enterprise, including application development and testing.

Ensuring privacy for customer data used in testing has always been a challenge. Testing teams strongly prefer to use real customer data extracted from production environments in testing, since this provides the most realistic representation of how an application will behave “in the wild.” A recent global CIO survey found that 86 percent of testers routinely use real customer data extracted from production systems for this very reason.  43 percent either don’t anonymise this data at all, or are unsure if they do. Anonymised test data does not only ensure better protection for customer data overall; it also eliminates the GDPR mandate to obtain customers’ explicit consent for it to be used in testing. More than half (53 percent) of this survey’s respondents noted securing such permissions as one of the biggest hurdles in GDPR compliance.

In a pre-GDPR world, failure to disguise test data meant increased exposure to potentially embarrassing and costly breaches. This concern still stands, but GDPR increases the stakes by subjecting non-compliant companies to eye-watering fines. So organizations facing GDPR must implement test data privacy projects. Here are some suggestions for getting started:

Take Inventory of What is Truly Sensitive Data – It’s Not Necessarily Names

The first step is to take inventory of all sensitive data, by creating and identifying the columns of information that need to be disguised. Contrary to popular belief, this does not include names. Names as such are not sensitive – this very article is signed with my – real! – name. What is sensitive, is the combination of information that uniquely identifies a person (a name and address, or a name and a social security number).  Eliminating names can make it unduly difficult to discern individual customer records as they move through the transaction funnel.

For example, in the basic encryption model, an input of “Marcin Grabinski” may deliver an output of “XP/JHrCWEAJssPeBCrWkXniHAdo” – which is indistinguishable to testers unless the process is automated (and most testing continues to be manual). Given their rigid timelines, testers cannot be expected to manually compare random outputs such as this, one by one, in order to ensure that applications are working. Not only would this lengthen required testing timeframes to unacceptable levels, but it would also substantially widen the margin for tester error.

The goal of test data privacy is not to disguise data itself, but make it reasonably difficult to identify individuals – a concept (and term coined by the GDPR) as “pseudonymisation.” It’s OK to use real customer names from the production database, as long as these names are not linked to home addresses, date of birth, passport, license number or any other identifying information. Retaining real, easily recognizable names makes it much easier and faster for testers to accurately track application execution in their testing environments.

Format-Preserving Encryption as a Disguise Technique

As discussed, the challenge with standard encryption is that it can make it very difficult for testers to identify the information he or she is viewing. This flies in the face of the reality of modern testing, which places a premium on speed and accuracy.

A new type of encryption, known as format-preserving encryption, tends to work better. Format-preserving encryption keeps the original format of input data while masking it – thus making it more useful for data testing purposes. An example is phone numbers in the U.K.– these may be reflected as +44 (followed by a set of 10 random digits). The tester knows he or she is looking at a UK phone number, but besides the +44, the remaining data is encrypted. Just like keeping real names, format-preserving encryption makes it much easier and intuitive for testers to track disguised test data throughout end-to-end test execution.

Using Real Data Values for Data Disguising

One drawback of format-preserving encryption is that it is not ideal for tests involving non-numerical data values, like addresses – since there is no uniform marker or symbol like +44.

When disguising addresses, some organizations use real, unrelated address data values from other files to mask real address data. One leading bank recently executed a test where they swapped in addresses from their U.K. branch locations for U.K. customer addresses. An important note on addresses – which is particularly relevant to organizations facing GDPR compliance – is that address formats vary across countries.  In some instances, replacement values must come from a defined range – for example, U.K. addresses must be mapped to U.K. addresses; French addresses must be mapped to French addresses, etc.

When using real data from other files for masking, there are several techniques to consider, including random and sequential swapping.  Picking the best technique depends on certain questions like whether the resulting sets must be unique, whether they must be consistent across repeated executions, and if the disguising needs to be reversible, reverting test data sets back to their original (and true) values. Random swapping tends to work best when there is only one test execution being done.  If there will be numerous runs, the constant random swapping of data can make it difficult and confusing to track the execution of a single application. In addition, in random swapping, output does not reflect input, so once data values are swapped, it can be difficult and time-consuming to accurately switch them back.

A different method, sequential swapping, works well in instances of a test being executed multiple times, or when the test data needs to be reversed back to its true form. Sequential swapping is self explanatory – the address from the second row is swapped into the first, the address from the third row is swapped in for the second, and so on. In this way, testers can more easily track the identity of data records based on the number of test runs, as well as use the number of test runs to guide them back to the true data value (making the method reversible).

Creating a Look-Up Table – Less is More

Many organizations think that in order to deliver reliable, comprehensive results, testing tables need to be large, comprising the same number of rows as the production database.  This is not true – it is perfectly acceptable to test only one to five percent of data records.

One important issue, particularly with small test samples, is that the data must be void of any format discrepancies that can make it too easy to connect people with address information. For example, if there is a set of ten true data records and only one has the address information in lowercase, testers may more easily remember the real name associated with this address, after data is randomly swapped. The goal of test data privacy is to make it reasonably difficult for anyone, including testers, to identify individuals. A high degree of data quality is therefore an important pre-requisite in the look-up table creation process.


The pace of software development and modifications has reached a new level, and the onus is on testers to keep up. Like virtually all companies, those facing EU compliance want software updates and changes made as soon as possible. But they tend to focus more on the cost of not having the feature they need, which will likely pale in comparison to the fines they could face if these changes are implemented in a haphazard, non-secure manner.

Test data privacy projects may require a fair amount of time and work, but the overall benefits of better customer data security are well worth it. For many organizations, GDPR just might be the prescription they need – a wake-up call to end the procrastination and eliminate some long-standing bad habits.  With a properly implemented test data privacy project, organizations can achieve the dual benefits of using test data that is intuitive, usable and reliable, while protecting EU citizens’ right to privacy. The time to start these projects is now.


More Data Security Topics

OWASP Security in a Continuous Integration World

Cross-Site Scripting Attack (XSS) – A Major Security Threat for Agile Environments