Anonde · concepts & FAQ

01What is Anonde?

Anonde is an open-source, self-hosted toolkit that anonymizes personal data and secrets in text, JSON, PDFs, and logs before they reach an LLM, then lets you reveal the real values back inside your own trust boundary. You run it as a Go library or a Docker image on your own infrastructure. There is no SaaS in the loop.

It exists because LLMs leak. Anything you put in a prompt can land in a model provider's logs, a fine-tuning corpus, or another tenant's context window. You need a programmable boundary that sits between your app and the model, replaces sensitive spans with stable placeholders, and reverses them only on the way back to the user. Anonde is that boundary.

02What is PII?

PII (personally identifiable information) is any data that can be tied back to a specific person, plus the secrets that protect their accounts. It splits into three rough buckets: people-data (names, emails, phone numbers, addresses, dates of birth), structured identifiers (IBAN, SSN, national ID, credit-card numbers, tax IDs), and secrets (API keys, JWTs, OAuth tokens, private keys).

The list is jurisdiction-dependent. A German IBAN, a US Social Security Number, and a UK National Insurance number are all PII, but they look nothing alike on the wire. That's why Anonde ships 52 region-aware recognizers across 12+ jurisdictions instead of one “id_number” regex.

03What is a recognizer?

A recognizer is one of the small, deterministic building blocks that find structured PII. Most are a regex with a checksum or a context check stapled on: an IBAN recognizer matches the country-prefixed format and validates the mod-97 checksum, so it doesn't fire on a random number string that happens to look IBAN-shaped.

They are region-aware because the same concept looks different everywhere. A German phone number, a French SIRET, a US SSN, a UK NI number, and a Brazilian CPF all need their own recognizer. They run in parallel inside the analyzer and each emits typed spans with a confidence score.

04What is NER (Named Entity Recognition)?

NER is an ML model that reads a sentence and labels spans with categories like PERSON, ORG, or LOCATION. Instead of asking “does this string match a pattern?”, it asks “in this context, is this a person's name?”. That matters because most real PII isn't structured. “Dr. Schmidt told Maria to call the clinic in Köln” has no regex you can write.

Anonde uses NER as the second leg of the analyzer. Patterns catch the structured things; NER catches the free-form things. The two streams are merged and de-conflicted before anything is anonymized.

05What is GLiNER?

GLiNER is an open-set NER model: you hand it the labels you care about at request time (PERSON, ORG, MEDICAL_CONDITION, whatever) and it finds spans matching those labels. There is no fixed vocabulary baked into the weights, which is why one model generalizes across languages, domains, and entity types far better than a classic fixed-tag tagger like a fine-tuned XLM-R.

Anonde's NER variant ships with knowledgator/gliner-pii-base-v1.0 at threshold 0.40, running through ONNX Runtime. We picked it after benchmarking against fixed-vocab alternatives: it had materially higher recall on German clinical text and held up across the other languages we tested.

06Patterns vs NER: when does each win?

Patterns win for structured types: IBAN, credit-card, email, phone, postal code, date, URL. They have checksums or rigid grammars, so a regex hit is almost always a true positive. NER on the same span is noisier and would lower precision for no gain.

NER wins for unstructured types: PERSON, ORG, LOCATION, AGE, PROFESSION, NRP (nationality / religion / political affiliation). No regex can enumerate every possible name or company. Anonde's conflict resolver encodes this rule directly: for the structured types, the pattern hit beats the NER hit regardless of model score; for the unstructured types, NER wins.

07What is the vault?

The vault is the local store that holds the reversible mapping between tokens and the real values they replaced. It's the only thing in the system that can de-anonymize, so it stays on your infrastructure and never leaves your trust boundary.

Two backends ship in the box: an in-memory store (ephemeral, fine for stateless request/response flows) and a bbolt store (a single embedded file, persistent, no external dependency). Both speak the same interface, so you can start with in-memory and upgrade to bbolt without changing your code.

08What are tokens?

Tokens are the placeholders that replace real values in the anonymized output: <PERSON_1>, <EMAIL_ADDRESS_1>, <IBAN_2>, and so on. They are stable per (tenant, doc): every mention of “John” in the same document becomes the same <PERSON_1>, so the LLM can still reason about coreference without ever seeing the real name.

Tokens are opaque on their own. The only way to map <EMAIL_ADDRESS_1> back to john@example.com is the vault, so leaking the anonymized text leaks nothing.

09What does “reveal” mean?

Reveal is the inverse of anonymize: it walks the vault and substitutes each token with the real value, so the user sees “John” and john@example.com instead of <PERSON_1> and <EMAIL_ADDRESS_1>. It happens entirely on your infrastructure.

Every reveal call carries an actor (who is asking) and a purpose (why). The pair is recorded with the operation, so you have an audit trail of every de-anonymization. That's what you need for GDPR, HIPAA, or any internal access policy.

10Patterns-only vs NER image: which do I want?

The patterns-only image is around 12 MB, contains only the regex + checksum recognizers, and is the right pick for structured data: logs, API payloads, form submissions, database exports. It's fast, it has no ML dependency, and it's easy to run anywhere.

The NER image is around 770 MB (~530 MB with INT8) and bakes in GLiNER plus the ONNX runtime. Use it when the input is free-form text: chat transcripts, support tickets, clinical notes, contracts. If you've deployed the NER image but want to skip NER for a particular call, set disable_ner: true on the request and it falls back to patterns-only for that one call.

11The analyzer pipeline, in one paragraph

Input goes into internal/core/service.go, the transport-agnostic orchestrator. The analyzer fans out the 52 pattern recognizers in parallel and, if enabled, runs the NER backend over the same text. Findings are de-conflicted (patterns win for structured types, NER wins for unstructured types), the anonymizer applies the configured transformation to each span, adjacent spans of the same type are merged, the vault records the token↔value mapping, and the anonymized payload is returned. Reveal walks the same vault in reverse. The same pipeline backs REST, Connect, and native gRPC on a single port.

12Why local-first / self-hosted?

Because the data you're trying to protect and the vault that can un-protect it should never leave your network. If Anonde lived in someone else's cloud, you'd be trusting a third party with the exact thing you were trying to keep private in the first place, defeating the point.

Self-hosted also makes audits tractable: there's one process, one vault file, one set of logs, all on infrastructure your security team already controls. And there's no vendor lock-in to worry about: the library and the image are open source, the formats are documented, and you can walk away with your vault any time.

13Anonde vs Microsoft Presidio?

Same shape, different defaults. Presidio is the reference design for programmable PII anonymization, and Anonde takes its core ideas (recognizers, reversible tokens) and rebuilds them with different tradeoffs. Anonde is Go-native, so deployment is a single static binary or a distroless image; there's no Python runtime to manage. The NER backend (GLiNER) is part of the project, not a bring-your-own component, so out-of-the-box detection on free-form text is much stronger.

On the gold corpora we benchmark against, Anonde has a lower leak rate than Presidio on 5 of 6, with the gap largest on German clinical and legal text. And it's designed to sit transparently in front of any LLM API rather than being something you bolt into a specific app: same library, same image, same payloads whether you're guarding OpenAI, Anthropic, a local Llama, or your own fine-tune.

Ready to try it? Read the quickstart or star on GitHub.

Quickstart → GitHub

What Anonde is, and how it works.