Implementing envelope encryption

Type: ExplanationCreated: Team: Security

draft

Three ways to scope the keys

Envelope encryption gives you one decisive lever: how many DEKs you have and what each one protects — the granularity introduced in Envelope encryption. The crypto itself is identical in every case: the same AES-GCM over the data, the same KEK in KMS wrapping the DEKs. What changes is the scope of a key, and that single choice decides two things — your blast radius if a key leaks, and whether you can erase a data subject simply by destroying a key.

Three scopes come up in practice: one key for everything, a key per user, and a key per tenant.

One key for everything

A single DEK encrypts every field across all data, and one KEK in KMS wraps that DEK. The DEK is unwrapped once at startup and held in memory; encryption is then a plain encrypt(value) / decrypt(value) against that one key.

What you get. All the key-management benefits envelope encryption exists for: no plaintext master key sitting in a secret store, central rotation of the KEK, and a CloudTrail record of every unwrap. It's the simplest of the three to build — the code that encrypts a field never needs to know who owns the row.
The trade-off. One key means one blast radius — every record shares it. And there's no crypto-shredding: because the same key protects everyone, you can't make one person's data unreadable without affecting all of it.
GDPR fit. Satisfies the "encrypted at rest" expectation of Article 32 (security of processing). It gives you no shortcut for Article 17 erasure, though — with one shared key, deleting a subject still means finding and removing the actual data everywhere it lives (see What GDPR actually requires below).

A key per user

Each data subject gets their own DEK, and you select the right key per row.

How it works. Each user's DEK is wrapped by the KEK and stored in a key table keyed by scope — for example (USER, <id>) → wrapped DEK plus a version. A key service resolves the DEK for a row: check an in-memory cache (short TTL, bounded), otherwise load the wrapped DEK and ask KMS to unwrap it, then cache the plaintext. The first write for a new user generates a fresh DEK.
A self-describing value format. Because different rows use different keys, each stored value has to declare which key it needs. A small header does that: [version][key reference][IV][ciphertext + GCM tag]. Use AES-GCM with a random IV per value, so each value is independently decryptable and tamper-evident.
Where the crypto has to live — the real design decision. Per-user selection needs to know which user owns the row before it can pick a key. A stateless field-level hook never sees that — it only ever sees the field value in isolation. So the encryption has to move to where the whole entity is visible: ORM lifecycle listeners (on insert, update, and load) or the service layer. This relocation, not the crypto, is the bulk of the work in adopting per-user keys.
What you get. Crypto-shredding. To erase a user you destroy their key row, and the ciphertext becomes permanently unreadable everywhere it exists — primary, replicas, and backups — without hunting it down. Erasure is a one-row delete plus a cache eviction: effectively instant, and it lands across every copy at the same moment.
The trade-off. The most moving parts of the three — a key table, a key service, and per-row key resolution on the hot path. Adopting it over an existing single-key store also means re-encrypting the existing data under the new per-user keys. The usual containment is to scope per-user DEKs to only the fields that genuinely need erasure — special-category data such as national ID, salary, or health — and leave everything else on the single-key model, which keeps frequently-queried columns out of the per-row-key machinery.
GDPR fit. This is what makes Article 17 erasure ("the right to be forgotten") practical at scale — destroying one key is cleaner and more complete than chasing copies of the data through every replica and backup.

tip

Don't hand-roll the message format and data-key caching. The AWS Encryption SDK implements envelope encryption, data-key caching, and a self-describing message format for you — turning much of this plumbing into configuration. Evaluate it before writing your own key service.

A key per tenant

Each tenant — each customer — gets its own DEK, and all of that tenant's data is encrypted under it.

Why this is cheaper than per-user. Recall the hinge from per-user keys: a stateless field-level hook can't select a key because it never sees who owns the row. At the tenant level that problem disappears. A multi-tenant system almost always resolves which tenant at the very start of each request — it has to, in order to route to the right database or schema — so the tenant identity is already sitting in an ambient, request-scoped context. The encryption hook can simply read the current tenant from that context and ask for that tenant's key. No lifecycle listeners, no moving crypto into the service layer, no per-entity lookups — the simple field-level hook stays. That single fact is why per-tenant is a small step where per-user was a project.

What it takes.

A key registry — one wrapped DEK per tenant. Its natural home is wherever tenant metadata already lives (a registry or control database), or a config row inside each tenant's own database.
A key service that resolves the DEK for the current tenant — read the tenant from the ambient context, fetch that tenant's wrapped DEK, unwrap it via KMS, and cache the plaintext in a bounded cache keyed by tenant. You then hold one DEK per active tenant in memory instead of one global key.
The encryption hook reads the key from the key service (current tenant) instead of a single global key.
A provisioning step — when a tenant is created, generate a DEK and store its wrapped form in the registry.
Migration is the pleasant part: you re-encrypt one tenant at a time, rolling it out client by client with a low blast radius, instead of one big risky backfill.

The cost profile matches the single-key model — still one KEK, DEKs are cheap to store, and caching keeps KMS calls down.

The honest GDPR caveat. Be clear-eyed here: tenant-level isolation does not improve your individual-erasure story. An Article 17 request is almost always for one employee inside a tenant, not the whole client — and destroying a tenant's DEK would nuke that entire client's data, useless for "delete this one person." So for individual erasure you're still on the delete + backup-retention-window + suppression-log approach described under Erasure under a single key below. Per-tenant keys don't change that.

What per-tenant isolation genuinely buys you is two other things:

Defense in depth and a smaller blast radius. A leaked or mishandled DEK exposes one tenant, not all of them — and a breach correspondingly narrows to one client's data subjects (Article 34) rather than your entire customer base. Combined with already-separate per-tenant databases, that's a clean isolation story for security questionnaires and audits, and the per-tenant unwrap trail in CloudTrail is tidy Article 30 accountability evidence. The benefit is sharpest against a leaked key, though — an attacker who fully owns the running app can often request decrypts across tenants regardless.
Whole-tenant offboarding. When a client leaves and contractually demands "destroy all our data, including backups, now," destroying that tenant's DEK makes every copy encrypted under it — live, replica, and backup — instantly unrecoverable, doing what a backup-retention window cannot. That is the processor's Article 28 duty (delete or return a client's data at the end of the relationship) executed cleanly, and for B2B SaaS it is often invoked more often than individual erasure. It is also the only scope at which crypto-shredding works at all: a single shared DEK can't shred anyone, because the one key protects every tenant — so per-tenant is the minimum granularity the technique needs. One caveat carries forward: "every copy" only holds if the tenant's backups are encrypted wholesale under that key, not just its sensitive fields (see What crypto-shredding actually erases).

tip

Frame per-tenant keys as security isolation and clean client offboarding, not as individual GDPR erasure — because individual erasure is the one gap they don't close.

From a GDPR standpoint, erasure is the only place these scopes meaningfully differ. Everything else lines up the same:

Security and confidentiality (Article 32) — identical. With one key or a million, the database holds only ciphertext and the key lives in application memory, unwrapped from KMS. A database-only breach yields useless ciphertext either way.
Breach-notification exemption (Article 34) — identical. Encrypted, unintelligible data qualifies for the exemption regardless of how many keys protected it; the regulator doesn't count keys.

What per-user keys genuinely add is cryptographic isolation between subjects — and the practical payoff of that is clean erasure, plus a slightly tighter blast radius if a key ever leaks. For the pure confidentiality story, the two are a wash.

note

Crypto-shredding is a technique, not a GDPR requirement. Article 17 only says the data must become irrecoverable — it never mandates per-user keys. A single-key design is fully compliant for erasure; it just satisfies the obligation differently, by actually deleting the data rather than destroying a key.

Erasure under a single key

Without per-user keys you can't crypto-shred, so erasure is handled the conventional way. It splits into two parts — one easy, one not.

The live database is the easy part. You hard-delete (or anonymize) the subject's rows, and the DELETE propagates to read replicas automatically. That alone satisfies erasure for your active data, with no crypto tricks at all.

Backups are the real friction — the same problem crypto-shredding was elegantly dodging. You can't reach into a backup snapshot to surgically scrub one user. Regulators have addressed this pragmatically: the accepted approach (the UK ICO's guidance on deleting from backups is the usual reference point, and the EU position is consistent) is to put backup data beyond use rather than instantly purge it — provided you:

Don't restore it into live processing while it still contains the erased user.
Have a defined retention window — snapshots age out on a schedule (say 30 or 90 days), so the data is genuinely gone after that period.
Re-apply the erasure on any restore — keep a record so a restored backup has the user re-deleted before it is used again.

That third point is the one piece of real engineering: maintain a small suppression log — a list of erased subject IDs (or hashes) — that your restore procedure consults to re-delete them. It's cheap, it's auditable (good Article 30 accountability evidence), and it's the standard answer for teams running automated database snapshots.

So the single-key erasure design is essentially: hard-delete from the live database → backups cycle out on their retention schedule → the suppression log guarantees re-deletion on any restore.

Two more options worth folding in:

Anonymization instead of deletion. You don't always have to delete. If you strip the identifying fields so a record can no longer be tied to a person, it falls outside GDPR entirely — often cleaner than a hard delete for data where you need to keep transactional totals (invoicing, aggregate HR figures) but not the person.
Legal-retention exemptions. Erasure isn't absolute — some records must be kept regardless of a request. That's a classification decision, covered in Retention versus erasure below.

What crypto-shredding actually erases

Crypto-shredding has one hard limit worth internalizing: destroying a key renders unreadable only the data that was actually encrypted under that key. The encryption boundary is therefore what decides how complete a shred is.

Field-level encryption shreds only the fields. If you encrypt a subset of columns — the sensitive ones — the rest of the row (names, emails, IDs, dates) stays plaintext in the database, and so stays plaintext in every backup. Destroying the key kills the encrypted columns and leaves everything else readable in the snapshot. Field-level crypto-shredding is not whole-footprint erasure, however tempting it is to assume otherwise.
To shred a whole footprint, the key has to wrap the whole store. Where you control the backup format — logical, per-tenant dumps — you can envelope-encrypt the entire dump under the tenant key rather than its individual fields. The backup becomes one opaque blob, and destroying the key makes all of it unrecoverable at once. That gives the tenant key two jobs: field-level protection inside the live database, and whole-blob protection of the backup files.
Managed snapshots are the awkward case. Automated snapshots (RDS-style) are encrypted under the instance's own KMS key, which every tenant on a shared instance also shares — so you can't shred it for one tenant without destroying everyone's. Per-tenant instances with their own keys solve it but cost more to run, and most teams don't go there. For that residue you fall back to the conventional playbook: a short, documented retention window, no restore into live use meanwhile, and the suppression log to re-delete on any restore (see Erasure under a single key).

The practical caution follows directly: adopt per-tenant keys only for field encryption and leave the logical backups unencrypted as a whole, and you take on all the key-management work without actually buying the offboarding benefit — the value lives at the backup-encryption layer, not the field layer. So the realistic offboarding design is a combination: drop the tenant's live database, crypto-shred the wholesale-encrypted logical backups, and let the managed snapshots age out under the retention window. The honest claim to a departing client is then not "every byte vanishes the instant we press the button" but "live data and all controlled backups destroyed immediately; the only residue is time-limited system snapshots that are never restored without re-deletion and expire within a defined window" — a strong, auditable statement, and the genuine ceiling short of giving every tenant its own database instance and KMS key.

Retention versus erasure

Erasure is not absolute, and the two obligations — "erase it on request" and "retain it to satisfy the law" — pull against each other by design. Article 17 itself carves out the exception: the right to erasure does not apply where processing is necessary to comply with a legal retention obligation. So the goal is never "delete everything" — it is delete what you can, retain what you are legally bound to, and don't conflate the two.

The mechanism that resolves it is classification. Sort personal data into buckets, each with a documented retention period and a defined end:

Erase-on-request / erase-on-offboard — the bulk of operational data, with no statutory retention. This is what crypto-shred and hard-delete handle.
Must-retain-for-period — the narrow slice a specific law requires (payroll figures, tax-relevant invoices, some employment records). Kept for the mandated window, then purged when the clock runs out.
Audit and log data — access logs, change history, and security trails kept for your own Article 32 and accountability needs. Its own retention logic again.

When an erasure or offboarding event fires, the flow is therefore: carve off the must-retain slice into a separate store — minimized to only the fields the obligation requires, on its own retention clock, with its legal basis recorded — then shred or delete everything else. Retention stops blocking erasure, because erasure becomes "destroy all of it except the carved-off subset, which lives on its own timer."

This is the same separation that keeps whole-tenant offboarding clean. At tenant exit the retention duty for employee data usually belongs to the departing client (the controller), not you (the processor, whose Article 28 duty is to delete or return). And your own must-keep business records — invoices, contracts, payment history — live in your finance systems under your own keys, never under a tenant DEK, so a shred never touches them.

The trap to avoid is treating "we might need it for audit" as licence to keep full personal records indefinitely. That fails twice over — on storage limitation (no defined end) and purpose limitation (audit is not why the data was collected). The defensible posture is always specific: which data, under which legal basis, for how long, and what happens when the period expires.

note

Which records carry mandatory retention, and for how long, is a compliance and DPO call, not a technical one — it turns on the applicable tax, payroll, and employment law. The architecture's job is only to honour whatever periods they set; the classification table — what is retained, why, and for how long — is the artifact that makes both the erasure and the retention defensible at an audit. (This maps the technical shape, not legal advice.)

Three ways to scope the keys​

One key for everything​

A key per user​

A key per tenant​

What GDPR actually requires​

Erasure under a single key​

What crypto-shredding actually erases​

Retention versus erasure​