·TL;DR
- PII is personally identifiable information: any data that can identify a specific person, on its own or in combination with other data.
- It splits into direct identifiers (name, SSN, email, phone, passport number) and quasi-identifiers (date of birth, ZIP code, gender) that single out a person when combined.
- Yes, an email address is PII. So are IP addresses, device IDs, and biometrics in most modern frameworks.
- PII vs PHI: PHI is health PII held by HIPAA covered entities. HIPAA names 18 identifiers. All PHI is PII; not all PII is PHI.
- PII vs personal data: personal data is the broader GDPR term. It covers anything relating to an identifiable person, including online identifiers.
- For LLMs, a prompt that contains PII is a disclosure. The fix is to keep PII out of the prompt: redact it before the model and reveal real values only inside your trust boundary.
01What is PII? The definition
PII is personally identifiable information: any data that can be used to identify a specific person. The most cited definition comes from NIST SP 800-122, which describes PII as any information about an individual that can be used to distinguish or trace that person's identity, either alone or when combined with other data that is linked or linkable to them.
Two phrases in that definition do the real work. "Distinguish or trace an identity" means single out one person from a group. "Alone or combined with other data" means you do not need a single magic field. A pile of harmless-looking attributes can be PII together even when none of them is PII alone. That second clause is why PII is wider than most people assume.
02Direct vs indirect identifiers
PII comes in two shapes. Direct identifiers point at one person by themselves. A Social Security number, a passport number, or a personal email address names a single human with no extra context. Indirect identifiers, also called quasi-identifiers or linkable data, do not single anyone out on their own but do so in combination.
The classic example: date of birth, ZIP code, and gender together. None of the three identifies you alone. Millions of people share your birthday, and millions share your ZIP code. But the intersection of all three points to a very small number of people, often exactly one. Researchers have shown that this trio re-identifies a large share of the US population. That is why ZIP plus date of birth plus gender is treated as PII even though each field looks innocent. The lesson for any system handling data: you cannot decide a field is safe just by looking at it in isolation.
03Examples of PII
Here are concrete types of PII, split into direct identifiers and quasi-identifiers. This is not exhaustive, but it covers what shows up in real prompts, logs, and tickets.
| Direct identifiers | Quasi / linkable identifiers |
|---|---|
| Full name | Date of birth |
| Email address | ZIP or postal code |
| Phone number | Gender |
| Social Security number | Job title plus employer |
| Passport or driver's license number | City and approximate age |
| Home or mailing address | Race or ethnicity |
| Bank account or credit card number | Education history |
| IP address | Purchase history |
| Device identifiers (MAC, IMEI, advertising ID) | Browser or device fingerprint |
| Biometrics (fingerprint, face, voiceprint) | Geolocation traces |
Whether a quasi-identifier counts as PII depends on context and on what else it can be joined with. Treat the right column as PII when it can be linked back to one person.
04Is an email address PII?
Yes. An email address is PII. A personal address like jane.doe@gmail.com
is a direct identifier: it points at one person. Even a less obvious address becomes
PII the moment it can be linked to an individual, which is almost always. Under GDPR an
email address is personal data without much argument. Under US frameworks like NIST SP
800-122 it is a textbook example of PII.
The same logic answers the other "is X PII?" questions. Is an IP address PII? Yes, in most modern interpretations, because it can be linked to a household or a person. Are device IDs PII? Yes, advertising IDs and hardware identifiers are treated as PII because they track an individual across sessions. When in doubt, ask whether the value can be tied back to a person. If it can, treat it as PII.
05Sensitive PII
Not all PII carries the same risk. Sensitive PII is the subset whose exposure can cause real harm: identity theft, fraud, discrimination, or physical danger. Think Social Security numbers, financial account numbers, passport numbers, biometrics, health data, and login credentials. Many frameworks also treat data revealing race, religion, political views, sexual orientation, or precise geolocation as sensitive.
The distinction matters operationally. A leaked first name is awkward. A leaked SSN with a date of birth is a fraud kit. Sensitive PII deserves stricter handling: tighter access, encryption at rest, and the strongest case for keeping it out of any system you do not fully control, including a third-party model API.
06PII vs PHI
PHI, protected health information, is a subset of PII. It is PII that relates to a person's health, care, or payment for care, and that is created or held by a HIPAA covered entity (providers, health plans, clearinghouses) or a business associate. The slogan: all PHI is PII, but not all PII is PHI. Your email address in a marketing list is PII, not PHI. The same email in your hospital's patient record, attached to a diagnosis, is PHI.
HIPAA's Safe Harbor method names 18 identifiers that must be removed to de-identify health data. They include names, all geographic subdivisions smaller than a state, all dates more specific than a year (birth, admission, discharge), phone and fax numbers, email addresses, Social Security numbers, medical record numbers, health plan numbers, account numbers, certificate and license numbers, vehicle and device identifiers, URLs, IP addresses, biometric identifiers, full-face photos, and any other unique identifying code. Strip all 18 and the data is no longer PHI under Safe Harbor.
| PII | PHI | |
|---|---|---|
| Scope | Any data identifying a person | Health-related PII, a subset |
| Governing rule (US) | Sector and state laws; NIST guidance | HIPAA |
| Who holds it | Anyone | Covered entities and business associates |
| Identifier list | Open-ended (direct + quasi) | 18 named identifiers (Safe Harbor) |
If you build with health data and LLMs, the HIPAA angle deserves its own read. See our note on HIPAA and LLMs.
07PII vs personal data (GDPR)
"PII" is mostly a US term. The EU's GDPR uses the broader phrase personal data: any information relating to an identified or identifiable natural person. That definition is wider than typical US PII because it explicitly includes online identifiers like cookies, IP addresses, and device IDs, and because "relating to" is read generously.
GDPR also has its own bucket for the sensitive kind. It calls health data, biometrics, genetic data, and information on race, religion, politics, and sexual orientation special category data, with stricter rules. The practical takeaway: if your data touches EU residents, design for the wider GDPR definition, not the narrower US one. For the LLM-specific obligations, see GDPR and LLMs.
08Why PII matters for LLMs
Here is the part that catches teams off guard. A prompt that contains PII is a disclosure. When you paste a customer's name, email, and account history into a prompt, that PII leaves your trust boundary and lands in a third party's systems. It may be logged, retained, reviewed by humans, or used to train future models, depending on the provider and the plan. The legal status of that data does not change just because it is inside a prompt. It is still PII, and sending it is still a transfer.
The same risk shows up on the way back. Models can echo PII from context into their responses, into your logs, and into downstream tools. Agents make it worse: they chain calls, pass context between steps, and write to places you did not plan for. Every hop is another chance for PII to leak.
The fix is not "never use LLMs with real data." The fix is to keep PII out of the
prompt in the first place. Replace each personal value with a stable placeholder token
before the text reaches the model, send only tokens, and map them back to real values
only inside your own infrastructure. The model does useful work on
<PERSON_1> and <EMAIL_2> without ever seeing Jane
Doe or her address.
That is exactly what Anonde does. It is an open-source, self-hosted PII boundary for LLMs and agents: tokenize personal data and secrets in your network, call the model with placeholders, then reveal the originals inside your trust boundary using a vault that maps each token to its value. See the deeper how-to in redacting PII before the LLM, or watch it run in the live demo.
09FAQ
What is PII?
PII is personally identifiable information: any data that can identify a specific person, alone or combined with other data. NIST splits it into direct identifiers like name, SSN, email, and phone, and quasi-identifiers like date of birth, ZIP code, and gender that single out a person in combination.
Is an email address PII?
Yes. An email address identifies or links to a specific person, so it is a direct identifier. Under GDPR it is personal data, and under US frameworks like NIST SP 800-122 it is a standard example of PII.
What is the difference between PII and PHI?
PHI is a subset of PII: health-related PII held by a HIPAA covered entity or business associate. HIPAA names 18 identifiers that make health data PHI. All PHI is PII, but not all PII is PHI.
What are examples of PII?
Direct: full name, SSN, passport number, email, phone, home address, IP address, device IDs, account numbers, and biometrics. Quasi-identifiers: date of birth, ZIP code, and gender, which together can single out one person.
·Keep PII out of your prompts
See how it works, try the live demo, or read the quickstart to run Anonde on your own infrastructure.