News Anonde Agent: an in-network PII boundary for every AI tool. Join the waitlist →
Blog

HIPAA and LLMs: handling PHI safely

The moment a prompt carries a patient name, a record number, or a date of birth tied to a condition, you are moving protected health information. Send it to a third-party model and that is a disclosure. Here is what HIPAA actually requires, and how to keep PHI out of the model without breaking the workflow.

·TL;DR

  • A prompt with a patient identifier tied to health data is PHI. Sending it to a hosted model is a disclosure under the HIPAA Privacy Rule.
  • If a model provider handles PHI for you, they are a business associate: you need a business associate agreement (BAA) before any PHI flows.
  • HIPAA gives two ways to de-identify: Safe Harbor (strip 18 identifiers) and Expert Determination (statistical proof of very low risk). De-identified data is no longer PHI.
  • Replacing identifiers with tokens before the model, then revealing real values only inside your boundary, cuts what the provider ever sees while keeping the workflow usable.
  • The hard part is free-text notes that hide identifiers, and re-identification from quasi-identifiers. This is a technical control, not legal cover, and not legal advice.

01What counts as PHI?

Protected health information is individually identifiable health information held or transmitted by a covered entity or its business associate. Plainly: health data plus something that ties it to a specific person. The diagnosis alone is health data. The diagnosis next to a name, a medical record number, or an address is PHI.

Covered entities are health plans, healthcare clearinghouses, and most healthcare providers. A business associate is anyone who creates, receives, maintains, or transmits PHI on a covered entity's behalf. If you build software that processes PHI for a hospital or insurer, you are almost certainly a business associate, and so is any vendor you pass that data to.

02The 18 HIPAA identifiers

The Privacy Rule's Safe Harbor method lists 18 identifier categories that must be removed for data about the individual, their relatives, household members, and employers (45 CFR 164.514(b)(2)):

  1. Names
  2. Geographic subdivisions smaller than a state (street, city, county, precinct, ZIP code and equivalent geocodes)
  3. All date elements except year that relate to an individual, and all ages over 89
  4. Telephone numbers
  5. Fax numbers
  6. Email addresses
  7. Social Security numbers
  8. Medical record numbers
  9. Health plan beneficiary numbers
  10. Account numbers
  11. Certificate and license numbers
  12. Vehicle identifiers and serial numbers, including license plates
  13. Device identifiers and serial numbers
  14. Web URLs
  15. IP addresses
  16. Biometric identifiers, including finger and voice prints
  17. Full-face photographs and comparable images
  18. Any other unique identifying number, characteristic, or code

Note the catch-all at number 18. The list is necessary but not sufficient: removing the named fields does not help if a free-text note still spells out who the patient is.

03Why a prompt to a third-party model is a disclosure

When your application sends PHI to a hosted model, that PHI leaves your systems and enters someone else's. Under HIPAA that is a disclosure, and the model provider is acting as your business associate. Two consequences follow.

You need a BAA first. A covered entity or business associate must have a written contract meeting the requirements at 45 CFR 164.504(e) before disclosing PHI to a downstream party. The BAA binds the provider to permitted uses, safeguards, and breach notification. No BAA, no PHI. Not every model API offers one, and a consumer endpoint almost never does.

Provider logging is a real exposure. Prompts and completions can land in request logs, abuse-monitoring pipelines, or a fine-tuning corpus depending on the terms. Every one of those is PHI sitting somewhere you do not control. A BAA constrains what the provider may do with it, but it does not make the data disappear.

04The two HIPAA de-identification methods

HIPAA gives you a clean exit: de-identified data is no longer PHI, and the Privacy Rule stops restricting it. There are two recognized methods, both at 45 CFR 164.514(b).

Safe Harbor (164.514(b)(2)). Remove all 18 identifiers above, and have no actual knowledge that the remaining information could identify the person alone or in combination with other data. It is rules-based and auditable, which makes it easy to document, as long as you apply it consistently across every field, note, image, and piece of embedded metadata.

Expert Determination (164.514(b)(1)). A person with appropriate statistical and scientific knowledge determines, and documents, that the risk of re-identification is very small. This path keeps more analytic value (for example, full dates or finer geography) at the cost of a documented expert assessment. It is the route when Safe Harbor would strip data you genuinely need.

05Tokenize before the model, reveal on return

This is Anonde's view, not legal advice: the cleanest way to shrink HIPAA exposure with an LLM is to keep identifiable data out of the prompt in the first place. The pattern is de-identify going out, reveal coming back.

Replace each identifier with a stable placeholder token before the text leaves your infrastructure. The model reasons over the tokens. When the response returns, you swap the real values back inside your own trust boundary. A clinical message becomes:

Patient [PATIENT_001], MRN [MRN_7F2A], DOB [DATE_004],
seen at [CLINIC_002] for follow-up on [CONDITION_001].

The provider never sees the name, the record number, or the date. The placeholders are stable, so the model can still reason about "the same patient" across a conversation, and the answer comes back coherent once you restore the values. The mapping from token to real value is the re-identification key. It stays with you, never in the prompt.

Anonde is an open-source, self-hosted boundary that does exactly this for text, JSON, PDFs, and logs. As an engineering measure it lines up with what HIPAA's de-identification standard asks for: identifiers removed before disclosure, at a single point in the pipeline you can audit. It does not make you "HIPAA compliant," it is not a certification, and it does not replace a BAA or qualified counsel. It reduces the PHI that ever reaches someone else's model. The same minimisation logic shows up under GDPR and the EU AI Act.

06Pitfalls: free text and re-identification

  • Free-text notes hide identifiers. Structured fields are easy to strip. A nurse's note that reads "patient is the mayor's wife" or drops a phone number mid-sentence is where redaction fails. You need detection over the prose, not just the columns.
  • Quasi-identifiers re-identify. No single field on the list, yet a rare diagnosis plus a small ZIP code plus an admission date can single one person out. This is exactly the "actual knowledge" clause in Safe Harbor and the risk Expert Determination is meant to measure.
  • The key is PHI. The token-to-value mapping can restore real identities, so the original record and that mapping remain PHI. Keep them inside your boundary with the same controls as any PHI store.
  • Consistency across formats. An identifier scrubbed from a field can survive in an attached PDF, an image, or log metadata. De-identification has to cover every format the model might see.

07FAQ

Is sending PHI to an LLM a HIPAA concern?

Yes. A prompt with a patient identifier tied to health data is PHI, and sending it to a third-party model is a disclosure. If the provider handles PHI for you they are a business associate, so you need a BAA before any PHI flows.

What are the two HIPAA de-identification methods?

Safe Harbor (164.514(b)(2)) removes 18 identifiers and requires no actual knowledge of residual re-identification risk. Expert Determination (164.514(b)(1)) has a qualified expert document that the risk is very small. De-identified data is no longer PHI.

Does de-identifying before the prompt remove the need for a BAA?

If the data the model receives is genuinely de-identified, that disclosure is not regulated by the Privacy Rule. The catch is completeness: free-text notes often hide identifiers. Many teams keep a BAA as a safety net rather than betting on perfect redaction.

What is the re-identification risk with tokenized PHI?

The token-to-value mapping is the re-identification key and must stay inside your boundary. The model sees only tokens. Residual risk comes from identifiers the redactor missed and from quasi-identifiers that single someone out in combination.

·Sources

This article is general information, not legal advice. Anonde is a technical control that supports de-identification; it is not a HIPAA certification. Confirm your obligations with qualified counsel.