Data Classification: Complete Guide to Categorizing Sensitive Data

Data classification is the process of organising data assets into defined categories based on their sensitivity, content type, regulatory requirements, or business value. By assigning classification labels to datasets, tables, columns, and files, organisations create a systematic foundation for access control, data protection, privacy compliance, and risk management. Without classification, data governance is essentially guesswork: teams cannot protect what they cannot identify, and they cannot comply with regulations they cannot demonstrate adherence to.

TL;DR

Data classification organises data assets into sensitivity tiers (Public → Internal → Confidential → Restricted) and content type categories (PII, PHI, PCI) so that access controls, data masking, and compliance policies can be applied proportionately and automatically. It is the prerequisite for GDPR compliance, data security, and any governed use of sensitive data.

Why Data Classification Matters

Classification is a prerequisite for almost every downstream data governance activity. Data masking policies depend on knowing which columns contain sensitive data. Access control frameworks depend on knowing which datasets fall under which sensitivity levels. GDPR and CCPA compliance depends on knowing exactly where personal data resides across the organisation. Data retention policies depend on knowing which data is subject to legal hold requirements. Classification provides the metadata foundation that makes all of these policies executable.

The scale of modern data environments makes manual data management impossible. A large organisation might have hundreds of databases, thousands of tables, and millions of columns spread across cloud data warehouses, data lakes, SaaS applications, and on-premises systems. Without automated classification, the data governance team has no systematic way to identify where sensitive data lives, what regulations apply to it, or whether appropriate controls are in place.

Beyond compliance, classification enables proportionate security. Not all data deserves the same level of protection. Applying maximum security controls to every dataset is prohibitively expensive and operationally impractical. Classification allows organisations to apply strong controls where they are genuinely needed — on sensitive PII, financial records, and confidential intellectual property — while maintaining frictionless access to public and internal data that poses little risk if accessed broadly.

Classification Levels: A Standard Framework

Most organisations adopt a tiered classification framework with three to five levels. The most common four-tier model uses the following levels:

Public — Information intended for unrestricted public access. This includes published marketing materials, public-facing website content, press releases, and open datasets. Public data can be shared freely without restrictions and does not require special handling controls.
Internal — Information intended for use within the organisation but not for public disclosure. This includes internal policies, operational procedures, employee directories, and internal communications. Internal data should not be shared outside the organisation but does not require the same level of protection as confidential or restricted data.
Confidential — Sensitive information that could cause significant harm if disclosed inappropriately. This includes customer personal data, financial records, business strategies, legal documents, and employee compensation information. Confidential data requires access controls, encryption in transit and at rest, and audit logging of access.
Restricted — The most sensitive category, including information whose disclosure would cause severe harm: medical records, payment card data, trade secrets, and information subject to legal privilege. Restricted data requires the strongest available controls, strict need-to-know access, and additional oversight for any movement or processing.

The four-tier model is a starting point, not a mandate. Highly regulated industries such as healthcare, financial services, and government often require more granular frameworks that reflect sector-specific regulatory categories. Organisations should resist the temptation to create overly complex frameworks with many fine-grained levels. A framework that nobody can remember or apply consistently is worse than a simpler framework that is actually used.

Click to enlarge

Classification by Data Type: PII, PHI, and Financial Data

In addition to sensitivity levels, data is commonly classified by content type, particularly for compliance purposes. Personally Identifiable Information (PII) is any information that can be used to identify a specific individual, either directly or in combination with other data. Direct identifiers include name, email address, phone number, national ID number, passport number, and date of birth. Indirect identifiers include IP addresses, device identifiers, location data, and behavioural data that can be linked back to individuals through re-identification.

Protected Health Information (PHI) under HIPAA covers any health information that can be linked to a specific individual. The HIPAA Safe Harbor de-identification standard defines 18 specific identifiers that must be removed for health data to be considered de-identified, including names, geographic subdivisions smaller than a state, dates (other than year) related to the individual, and full-face photographs.

Payment Card Industry (PCI) data includes primary account numbers (PANs), cardholder names, expiration dates, and card verification values. PCI DSS imposes strict requirements on the storage, transmission, and processing of this data, including prohibition on storing CVV/CVC values after transaction authorisation.

Financial data encompasses a broad range of sensitive information including bank account numbers, tax identifiers, salary information, financial statements, and trading activity. Financial data is subject to multiple overlapping regulatory frameworks depending on jurisdiction and industry sector.

Classification Methods: Manual, Rule-Based, and ML-Based

Manual classification relies on data stewards and domain experts to review and label data assets. It is the most accurate classification method for well-understood, stable datasets — an experienced data steward who knows the business domain will correctly identify sensitive columns that automated tools might miss. However, manual classification does not scale to large, dynamic data environments. It is also inconsistent: different stewards may apply the same classification standards differently, and manual labels quickly become stale as new data sources are added.

Rule-based classification uses pattern matching, regular expressions, and keyword dictionaries to identify sensitive data automatically. Examples include recognising credit card numbers by their 16-digit format and Luhn check validity, identifying email addresses using a regex pattern, or flagging columns named "ssn" or "tax_id" as containing national identifier data. Rule-based classification is fast, scalable, and transparent: the rules can be audited, explained, and adjusted. Its limitation is precision — simple patterns generate both false positives and false negatives.

ML-based classification uses trained models to identify sensitive data based on column statistics, sample values, column names, and contextual signals from surrounding columns and table names. ML models can identify sensitive data in columns with non-standard names, detect partial matches, and handle multi-lingual data that rule sets struggle with. Modern ML-based classification systems combine named entity recognition (NER) models with metadata signals such as column name patterns and statistical distributions of values.

The best practice is to combine all three approaches: ML-based automated classification at scale, with rule-based checks for well-defined categories like credit card numbers and email addresses, and manual review workflows for low-confidence suggestions and high-risk datasets.

Automated Classification at Scale

Automated classification is the only viable approach for large data estates. The key capabilities that modern automated classification platforms provide include:

Discovery — automatically scan and inventory all data sources including databases, data warehouses, data lakes, SaaS applications, and file systems to identify all data assets requiring classification
Detection — apply rule-based patterns and ML models to analyse column samples and metadata, generating classification suggestions with confidence scores
Review workflows — route low-confidence suggestions to named data stewards for review and approval, with audit trails of all classification decisions
Propagation — automatically propagate classification labels through data lineage chains so that derived columns inherit the sensitivity of their source columns
Drift detection — monitor classified assets for changes in data content or schema that might invalidate existing classifications

Cloud-native classification services include AWS Macie (which uses ML to discover and classify sensitive data in S3), Google Cloud DLP, and Microsoft Purview Information Protection. These services are well integrated with their respective cloud ecosystems and provide out-of-the-box detection for common sensitive data types.

Integration with Data Catalog

Data classification reaches its full potential when integrated with a data catalog. A catalog provides the metadata context that makes classification actionable: the list of all data assets across the organisation, their technical metadata (schema, data types, row counts), their business metadata (owners, descriptions, glossary links), and their operational metadata (freshness, quality scores, lineage).

When classification labels are stored and managed in a data catalog, data governance teams can answer questions like: "Show me all tables containing PII in our AWS environment that were accessed by users outside the data team in the last 30 days." Or: "Which data products expose restricted financial data, and are they all covered by our masking policy?" These queries require the combination of classification labels with lineage, access logs, and governance policy data that only a well-integrated catalog can provide.

Database-native classification features in platforms like Snowflake, Databricks Unity Catalog, and Microsoft SQL Server allow classification tags to be applied directly to tables and columns in the database, where they can be used to enforce access control policies and column masking rules automatically. Data masking driven by classification labels ensures that sensitive columns are automatically protected when accessed by users without the appropriate permissions.

GDPR does not mandate a specific classification framework, but its requirements effectively require one. The regulation requires organisations to maintain a record of processing activities (ROPA) that documents, for each processing activity, what personal data is involved, for what purpose, on what legal basis, with what retention period, and with which third parties it is shared. Without data classification, building and maintaining a ROPA is essentially impossible at any meaningful scale.

GDPR's data minimisation principle — collect only the data that is strictly necessary for the stated purpose — also depends on classification. You cannot minimise collection of data types you have not defined and identified. Similarly, the right to erasure (right to be forgotten) requires the ability to identify and delete all records containing a specific individual's personal data across all systems, which requires comprehensive classification of personal data locations.

Special categories of personal data under GDPR — including health data, genetic data, racial or ethnic origin, political opinions, religious beliefs, and sexual orientation — attract heightened protection requirements and require explicit identification through classification. Organisations processing these categories must implement appropriate technical and organisational measures that go beyond standard personal data protections.

A successful GDPR data classification programme covers four phases: inventory (discover all data sources and assets); framework definition (establish classification levels and content type categories with legal and compliance stakeholders); initial classification (apply automated tools and human review); and operationalisation (integrate labels into access control, masking, monitoring, and governance workflows).

Data Classification with Dawiso

Dawiso provides automated data classification capabilities integrated directly into its metadata management platform. When connected to data sources, Dawiso scans column samples and metadata to detect sensitive data types — PII, financial data, health information — and suggests classification labels that data stewards can review and approve.

Approved labels flow through Dawiso's lineage graph, so if a PII column in a Silver layer table is transformed into a derived column in a Gold reporting table, the classification propagates automatically to the derived column, ensuring that sensitivity is not lost through transformation chains. This automated classification propagation dramatically reduces the manual effort required to maintain classification currency in fast-moving data environments.

Dawiso's integration of classification with its data catalog allows governance teams to search and filter the entire data estate by classification level, making it straightforward to answer audit questions like "show all columns containing payment card data" or "list all datasets with restricted classification accessed in the last quarter." These capabilities make classification not just a metadata label but a working governance control that drives real protection outcomes across the organisation's data platform.