AI and data privacy: using LLMs on personal data

Q: Is it safe to put personal data into AI tools?

It depends on the tier and the data. Consumer chat tiers may use prompts to improve models unless you opt out, and your text is retained for a period. Enterprise and API tiers usually do not train on your inputs by default and offer shorter retention. The safe pattern, regardless of tier, is to keep the personal data out of the prompt: anonymize it to placeholder tokens before the call, then reveal the real values only inside your own trust boundary.

·TL;DR

Is it safe to put personal data into AI? It depends on the tier and the data. Treat consumer chat and enterprise/API tiers as different risk profiles.
Data you send a hosted model can be retained, logged, seen by sub-processors, and on some tiers used to improve models. Policies differ by vendor and plan, and they change.
Consumer tiers often train on prompts unless you opt out. Enterprise and API tiers generally do not train by default and keep shorter retention. Always check the vendor's current docs.
The reliable fix is not a checkbox. It is to keep PII out of the prompt: anonymize to tokens like [PERSON_1] before the call, reveal real values inside your own trust boundary.
Anonde is an open-source, self-hosted boundary that does exactly this for any model.

01The real question people are asking

"Is it safe to put personal data into AI?" is the question behind most AI and privacy worry. People paste customer emails into ChatGPT to draft a reply. Engineers feed logs with real names into a coding assistant. Support teams summarize tickets full of addresses and account numbers.

The instinct is right. Personal data sent to a model leaves your control the moment it crosses the network. AI privacy is just data privacy applied to that moment: what the model sees, what gets stored, and who else touches it. So the useful answer is not yes or no. It is "here is what happens to the data, and here is how to make the answer not matter."

02What actually happens to data you send a model

When you send text to a hosted LLM, several things can happen to it. The exact mix depends on the vendor and your plan. In general terms:

Training. On some tiers the provider may use your inputs and outputs to improve models. On others it does not by default. This is the most discussed part of ChatGPT data privacy.
Retention. Your text is stored for some window, often to run the service, detect abuse, and meet legal needs. Windows range from days to longer, and differ by tier.
Logging. Requests and responses can be logged for abuse monitoring and debugging, sometimes with human review of flagged content.
Sub-processors. The vendor may use cloud infrastructure and other providers that also process the data under contract.

None of this is inherently sinister. It is how most SaaS works. But it means your personal data exists in places you do not directly control, governed by policies you did not write. That is the core of LLM data privacy.

03Consumer vs enterprise and API tiers

The single most important distinction is the tier. They are not the same product from a privacy standpoint.

Consumer chat tiers are tuned for individuals. On consumer ChatGPT, for example, content can be used to improve models unless you turn that off in the data controls, per OpenAI's own settings and docs. Free and personal tiers from other vendors often work the same way. The defaults lean toward product improvement.

Enterprise and API tiers are tuned for organizations. OpenAI states that it does not train on data submitted through the API or its business and enterprise products by default. Anthropic similarly describes commercial terms where inputs and outputs are not used to train its models by default. These tiers usually add shorter retention, data processing agreements, and admin controls.

Two cautions. First, "by default" and "unless you opt in" are doing real work in those sentences. Second, policies change. Read the vendor's current data usage page before you rely on it, and link the exact policy in your own records.

04The practical risks

Strip away the marketing and three concrete risks remain when personal data sits in a prompt.

Leakage. Data in a prompt can surface in logs, in a misconfigured integration, in a future training set, or in another user's output if a model memorized it. The blast radius is whatever you sent.
Compliance. Sending personal data to a third party can be a cross-border transfer and a new processing purpose. Under regimes like GDPR that needs a lawful basis, a data processing agreement, and records. For the detail, see our note on GDPR and LLMs. This is not legal advice; talk to your own counsel.
Retention. Even on a no-training tier, your text may live in the provider's systems for a window. If that data should not exist outside your walls, retention alone is a problem.

All three risks share one root cause: the real personal data was in the prompt. Remove that and the risks shrink to placeholders.

05The engineering answer: keep PII out of the prompt

Here is the move. Do not ask whether the model is trustworthy. Make the question irrelevant by never sending the model anything sensitive in the first place.

Put a privacy boundary in front of the model. Detect personal data and secrets in the prompt, replace each value with a stable placeholder token, send only the tokens to the model, then map the tokens in the response back to the real values inside your own infrastructure. The model reasons over structure, not identities.

Concretely, before and after:

Before: Email Jane Doe at jane@acme.com about invoice 4471 for her account in Berlin.
After (sent to the model): Email [PERSON_1] at [EMAIL_1] about invoice [INVOICE_1] for her account in [LOCATION_1].

The model drafts a reply using [PERSON_1] and [EMAIL_1]. On the way back, inside your trust boundary, those tokens become Jane Doe and her real email again. Training, retention, logs, and sub-processors all act on placeholders. There is no real personal data on the other side to leak. For a deeper walkthrough, see how to redact PII before an LLM.

06How Anonde does it

Anonde is an open-source, self-hosted PII boundary for LLMs and agents. It is written in Go and licensed Apache 2.0. It runs as a library or a small container image inside your own network, so the mapping between tokens and real values never leaves your control.

The flow is three steps:

Tokenize before the model. Personal data and secrets in text, JSON, PDFs, and logs become stable tokens like [PERSON_1] or [EMAIL_1].
Call the model with tokens. The provider, on any tier, only ever sees placeholders. This works the same for Claude, ChatGPT, Cursor, or a local model.
Reveal on return. Anonde maps the tokens in the response back to the originals using a vault that lives inside your trust boundary.

Because the real values stay home, the consumer-vs-enterprise tier question and the training question both lose their teeth. You can try the round trip in the live demo, or run it yourself from the quickstart.

07FAQ

Is it safe to put personal data into AI tools?

It depends on the tier and the data. Consumer tiers may use prompts to improve models unless you opt out, and they retain your text for a period. Enterprise and API tiers usually do not train by default and keep shorter retention. The safe pattern on any tier is to keep the personal data out of the prompt: anonymize to tokens before the call, reveal real values inside your own trust boundary.

Does ChatGPT train on my data?

It varies by tier and your settings. On consumer ChatGPT, content can be used to improve models unless you turn that off in data controls. On the OpenAI API and on business or enterprise plans, OpenAI states it does not train on your data by default. Check the current docs, since policies change. If you never send the real data, the training question stops being your main risk.

How do you use an LLM on private data safely?

Move the privacy boundary in front of the model. Detect personal data and secrets, replace each value with a stable token like [PERSON_1], send only the tokens, then map them back to real values inside your own infrastructure. The model reasons over structure, not identities.

What is the difference between AI privacy and data privacy?

Data privacy is the broad practice of controlling how personal data is collected, used, shared, and retained. AI privacy applies that to AI systems: what a model sees, whether inputs are logged or used for training, which sub-processors touch the data, and how outputs might expose someone. AI privacy is a subset of data privacy with its own failure modes.

·Try Anonde

See how it works, try the live demo, or read the quickstart to run Anonde on your own infrastructure and keep PII out of every prompt.