Why Financial Firms need Data Masking, now more than ever!
Let alone defining models for Operational Risk Management (ORM), risk managers struggle to define what is operational risk and rightly so because operational risk in itself is very vast. Some define operational risk as- “All the other risk(s) other than market and credit risk.”
According to Basel – “Operational risk is defined as the risk of loss resulting from inadequate or failed internal processes, people and systems or from external events. This definition includes legal risk, but excludes strategic and reputational risk.”
So trade failure, non-compliance to regulations, loss of data due to natural calamity, loss of key-personal, legal lawsuits etc. are all examples of operational risk. Thus while one is aware (most of the times) of their exposure to market and credit risk the same can’t be said for operational risk. How can one define the probability & severity of a failed trade or that of data theft?
On specific area under Operational risk management which is making headlines today is sensitive data protection or to be precise client identity protection. Although there have been regulations around this issue, yet as an aftermath of sub-prime crisis regulators are raising this issue with much more emphasis. One the main reason for the concerns raised by the regulators is the use of production data in non-production environment like development or testing of IT applications. This article aims at various means of protecting such data and suggests guidelines for effective data protection policies.
The Regulations
The Gramm-Leach-Bliley Act: Offers Privacy and Safeguards Rules to protect personal information held by U.S. financial institutions. The Privacy Rule speaks largely to information collection and sharing – with respect to data exposure, this rule mandates that certain information, such as account numbers, cannot be shared with third parties. The Safeguards Rule speaks more to protecting information.
The Identity theft Red Flags Rules: The final rules require each financial institution and creditor that holds any consumer account, or other account for which there is a reasonably foreseeable risk of identity theft, to develop and implement an Identity Theft Prevention Program (Program) for combating identity theft in connection with new and existing accounts. The Program must include reasonable policies and procedures for detecting, preventing, and mitigating identity theft and enable a financial institution or creditor to 1) Identify relevant patterns, practices, and specific forms of activity that are “red flags” signaling possible identity theft and incorporate those red flags into the Program; 2) Detect red flags that have been incorporated into the Program; 3) Respond appropriately to any red flags that are detected to prevent and mitigate identity theft; and 4) Ensure the Program is updated periodically to reflect changes in risks from identity theft.
PCI DSS: The Payment Card Industry Data Security Standard is a set of requirements for securing payment account data. The PCI DSS affects all the companies which handle payment card data, which are myriad. The requirements are straightforward, and include “protect stored cardholder data” and “restrict access to cardholder data by business need-to-know”.
OCC BULLETIN 2008-16: This bulletin reminds national banks and their technology service providers that application security is an important component of their information security program. All applications, whether internally developed, vendor-acquired, or contracted for, should be subject to appropriate security risk assessment and mitigation processes. Vulnerabilities in applications increase operational and reputation risk as unplanned or unknown weaknesses may compromise the confidentiality, availability, and integrity of data.
Out of these the last two are dedicated to the financial services industry. A study shows that these firms are responsible for protecting almost 85% of their entire data.
The Cost of Data Theft
Below are few numbers around data loss and theft.
Since 2005 over 250 million customer records containing sensitive information have been lost or stolen. Privacy Rights Clearinghouse
The 2008 breach report revealed 656 reported breaches at the end of 2008, reflecting an increase of 47% over last year’s total of 446.” Identity Theft Resource Center, 2009
62% [of respondents] use live data for testing of applications and 62% of respondents use live data for software development.” Ponemon Institute, December 2007
The average cost of a data breach has risen to $202 per customer record, which translates to roughly 20 Million USD per 100,000 records. Ponemon Institute 2009.
Recently, the credit card processor associated with the TJX data breach was fined $880,000 for failing to meet this standard. In the same incident, TJX paid a $40.9 million settlement to Visa.
Many firms have settled their lawsuits for millions and millions of dollars. Please note that although one can measure the cost of fines and lawsuits the cost of loss of reputation & customer trust is harder to measure. Needless to say that data protection is a serious issue and is becoming a bigger concern with each passing year as evident from the study by Ponemon Institute.
As discussed above the one of the main concern of the regulators pertains to the use of production or live data in non-production environments. Firms often prepare test beds for testing various IT applications, products etc. before deploying them for use. Since the testing of applications requires “real-like” data, almost 62% of the firms use production data to test these applications. This poses serious risks of identity and data theft. Also firms take huge pains to ensure the safety of their live or production data but somehow same standards are not applied when that data is copied to non-production environment.
What is Data Masking
Data masking is the process of obscuring (masking) specific data elements within data stores. It ensures that sensitive data is replaced with realistic but not real data. The goal is that sensitive customer information is not available outside of the authorized environment. Data masking is typically done while provisioning non-production environments so that copies created to support test and development processes are not exposing sensitive information. Masking algorithms are designed to be repeatable so referential integrity is maintained.
Differences between encryption and masking
Encrypted data is good in case when you want only people with the right “keys” to view the data. The data loses all its properties and hence it can’t be used by developers and QA professionals who need “real-like” data for testing applications. In contrast data masking or masking will prevent abuse while ensuring that properties of the data remain as they are in production environment.
Different Methods For Data Masking
Following is a brief discussion of various methods used for data masking. We have also discussed which method is to be used when.
Nulling
Deleting a column of data by replacing it with NULL values
Useful in role based access, when you don’t want to reveal the data
Can’t be used in testing environment, as data properties are lost
NULLing is not a data masking technique, but is used with other methods for data masking e.g. Credit Card numbers masked as 4234- XXXX- XXXX- 6565
Substitution
Randomly replacing the contents of a column of data with information that looks similar but is completely unrelated to the real details
Preserve the data properties
Since there is no logic or relationship involved like ageing & reorder one has to store the large amount of random, substitute data
Finding the required random data to substitute and developing the procedures to accomplish the substitution can be a major effort
Generating large random “real-like” data is difficult in some cases e.g. Street address
Useful for generic data like name, address, and numerical data with no properties (credit card pre-fixes and suffix etc.)
Shuffling/Reorder
The data in a column is randomly moved between rows until there is no longer any reasonable correlation with the remaining information in the row
Since the entire is only jumbled, the end-user still has access to the entire set of data can perform some meaningful queries on the same.
Shuffling algorithms fail if they are simple and can be easily decoded
It is useful only on large amount of data
It should be used along with ageing, variation etc. techniques which shuffle and also increase / decrease data by some fixed percentage.
On the plus side, this is one the easiest and fastest way of shuffling data
Numeric alternation
Increase/decrease numerical by %
% can be fixed or random but is selected to that the data stays within the permissible or probable values
It is generally used in isolation with other techniques
Gibberish Generation
Given any input, computer generates output which is random, but which has the same statistical distribution of characters or combinations of characters. (A character may be a letter, a digit, a space, a punctuation mark, etc.)
In level 1 gibberish, the output has the same distribution of single characters as the input. For example, the probability of seeing characters like "e" or "z" or "." will be approximately the same in the output as in the input. In level 2 gibberish, the output has the same distribution of character pairs as the input. For example, the probability of seeing a pair like "th" or "te" or "t." will be approximately the same in the output as in the input.
In general, in level n gibberish, the output has the same distribution of groups of n characters (n-tuples) as the input.
Just like encryption it will render the data meaning less
Should be used with role based access of data. i.e. when you want to subjectively shield data based on the role / purpose of the individual
Currently there are two main approaches for data masking in non-production environment :
EML (Extract Mask And Load): Data is extracted from the production db, it is masked and then it’s loaded to the pre-production server. It is useful when loading large amount of data.
IPM (In place Masking method): Data is directly loaded to the non-production db and there specific columns are masked before the data is released to the QA and Developers. Useful when data is less and you have well-defined sensitive data to protect.
By data masking one can ensure that data will retain its properties and can be used for analytical purposes. Also given the fact that almost 70 % of all data thefts are internal one needs to ensure that some form is data masking is employed within the organization to prevent internal threats.
Guidelines for Effective Data Masking
Start with fewer applications. As discussed above financial firms need to protect almost 85% of their entire data. If a firm can protect 15% data each year it is a good achievement.
One needs to understand that as discussed under various methods for data masking, one method is generally not sufficient for data protection and each has its pros and cons.
Ideally data masking must be used with role based access or exposure to data to provide a double protection.
The IT side in just one aspect of Data masking. The first thing that firms should consider is defining the firm-wide data protection policy. Without defining the KRIs and effective ways to measure them one can’t succeed with any operational risk initiative.
In order to save cost firms should consider 3rd party software vendor for IT products and solution around data masking. The policies and procedures should be maintained in-house.
Abhishek Dhall Headstrong August - 2009