Full products, DLP lite, content analysis
Data Loss Prevention (DLP) may seem confusing and complex, but it’s more the result of a relatively generic sounding term combined with vendor marketing programs that rely on any term that them, will sell a product.
The data loss prevention technology itself is straightforward once it’s broken down, even taking into account the differences between dedicated tools and DLP-lite features. Let’s take a look at DLP technology and its features.
Our definition of full DLP reads as follows:
“Products that, based on central policies, identify, monitor and protect data at rest, in motion and in use, through deep content analysis.”
This sums up the three defining characteristics of the technology:
- In-depth content analysis
- Wide content coverage across multiple platforms and locations
- Centralized policy management
The partial suite tools include deep content analysis and central policy (and incident) management, but only on a single platform (such as endpoints). DLP lite tools include basic content analysis, but generally lack dedicated workflow and wide coverage sacrifices. This makes it easier for us to describe a complete DLP tool because it provides the knowledge you need to evaluate other options as well, which are a subset.
Content analysis is the defining characteristic of DLP. If a tool doesn’t include content analytics – even basic ones – it’s not DLP. Content analysis is a three-part process:
- You first capture the data,
- Then you break the file format or rebuild the traffic, and
- Finally, you perform analysis using one or more techniques to identify policy violations.
Almost all DLP tools also capture contextual data, such as source and destination, for inclusion in analysis as well. Since we are going to be discussing how to monitor and capture data with collectors in technical architecture, let’s start with file cracking.
File cracking is an unofficial term for parsing textual data from a source to be passed to the content crawl engine. Crawlers need to work with text, and many of the file and data formats we use on a daily basis, such as Office documents or PDFs, are binary data. The file cracker takes a file, determines the format, and then uses a parser to extract all the text. Some tools can handle hundreds of file types, including complex situations like documents embedded in other documents, and then bundle them into a .zip file. It is the collector’s job to assemble the file and pass it on for cracking. It’s as easy as transmitting a stored file, or it can be more complex when extracting a streaming document to a cloud service over HTTP.
Once the file is opened, the content crawler evaluates the text and searches for policy matches. On occasion, tools will look for a binary match, as opposed to a text match, for data like audio and video files, but text analysis is where the real innovation lies.
Seven content analysis techniques are commonly available:
- Rules / regular expressions use text analysis to find matching patterns, such as the structure of a credit card or social security number. Some of these rules and regular expressions can be complex enough to minimize false positives. This is the technique we see most often in DLP-lite features. While it can work well, it is prone to false positives, especially in large environments. You would be surprised how many things match the format of a valid credit card number.
- Database fingerprint (exact match of data) extracts data from a database and only finds matches for the specified data. So you can load it with the hash values of your customers’ credit card numbers and stop seeing false positives when your employees order decorative tea cups from a website with their personal cards. The database footprint greatly reduces false positives, but only works when you have a good data source. Due to system requirements, it typically cannot run on endpoints, depending on the size of the dataset.
- Partial matching of documents takes a source file, parses the text, and then finds subsets of that text. It usually creates a series of overlapping hashes that allow you to do things like identify a single paragraph cut from a protected document and pasted into a webmail session. Like the database footprint, depending on the size of your dataset, it may not perform well on endpoints due to performance requirements. However, it can handle very large sets when running on a server or appliance.
- Binary file matching creates a hash of a binary file. This is the technique used to protect non-text data, but it is most prone to false negatives because even minor edits to a file will not match the hash value.
- statistical analyzes is a newer technique that uses machine learning or other techniques to analyze a set of known “protected” data and known “own” data to create rules for near matches. This is similar to antispam (and based on the same calculations). Most techniques require that you know exactly what to protect. Statistical analysis is subject to most false positives. However, it allows you to protect items that look like known sensitive data but might not correspond directly.
- Conceptual analysis / lexicon uses a combination of dictionaries, rules, and other analysis to protect classes of information that resemble a concept. For example, this could include looking for clues of insider trading or job searches by employees on the company network. This is the weakest of the techniques, due to the less clearly defined nature of a concept.
- Categories are predefined rule sets for common types of data, such as credit card numbers or health data, that many organizations want to protect. They allow you to jumpstart your DLP project without having to create all of your strategies by hand, and you can adjust them over time to better meet your needs.
A data loss prevention policy will combine one or more of these techniques with contextual data and additional rules, such as severity count or requirements per business unit.
About the Author:
Rich Mogull has nearly 20 years of experience in information security, physical security, and risk management. Prior to founding independent information security consulting firm Securosis, he spent seven years at Gartner Inc., most recently as Vice President, where he advised thousands of clients, wrote dozens of reports and has consistently been rated one of Gartner’s top international players. He is one of the world’s leading authorities on data security technologies, including DLP, and has covered issues ranging from vulnerabilities and threats to risk management frameworks and major application security.