Shortcuts

API Reference

This page details the Private AI REST API.

GET /healthz

Check the health of the container.

Response Body

The API returns a JSON object containing the following fields:

  • success: bool

    Whether the healthz request succeeded

  • last_auth_call_successful: bool

    Whether the attempted call to the Private AI authentication servers was successful. Note that this value defaults to false on startup, until the first deid API call has been made successfully.

Status Code

200 on success. Any other 4xx or 5xx code should be interpreted as a failed health check.

Example command:
 $ curl -X GET localhost:8080/healthz
{
   "last_auth_call_successful": false,
   "success": true
}

POST /get_usage

Return the number of units used in the current month.

Request Body

JSON object containing the following fields:

  • key: string (mandatory)

    License key provided to you by Private AI.

Minimal example:

{
   "key": "<customer key>"
}

Response Body

The API returns a JSON object containing the following field:

  • calls_made: int

    The number of API credits used, rounded to the nearest usage metering block (the Private AI authentication system meters credits in blocks, to minimize network traffic). A credit is defined as 100 words and a word is a whitespace separated piece of text.

Example command:
$ curl -X POST localhost:8080/get_usage -H 'content-type: application/json' -d '{ "key": "<customer key>"}'
{
  "calls_made": 26600
}

Processing Text

Once the Docker container is running, you can make requests to process text. This is a POST request to the ‘deidentify_text’ route with a JSON body that has the fields described below:

POST /deidentify_text

Remove identifiers from a string or multiple strings.

Request Body

JSON object containing the following fields:

  • text: string or array(string) (mandatory)

    UTF-8 encoded message(s) to de-identify.

  • key: string (mandatory)

    License key provided to you by Private AI.

  • accuracy_mode: string (default: “high”)

    Selects the model used to identify PII in the input text. By default, the “high” accuracy model is used. Whilst the models used by the Private AI solution are highly optimised (~25X faster than a reference transformer implementation), in high-throughput cases it is possible to trade accuracy for speed by selecting either the “standard” or “standard_high” accuracy modes. Multilingual support can be enabled by using one of the multilingual models, namely “standard_high_multilingual” (GPU container only) and “high_multilingual”. The multilingual models process all supported languages including English, without the need to specify language. It is advisable to use the English-only models where possible, as they perform slighty better on English.

  • link_batch: bool (default: false)

    When set to true, batch inputs will be joined together internally in the Private AI inference engine, to share context between the different inputs. This is useful when processing a sequence of short inputs, such as an SMS chat log.

  • enabled_classes: array(string) (defaults to all classes)

    Controls which types of PII are removed. See Supported Entity Types for the list of possible entities.

  • unique_pii_markers: bool (default: true)

    Specifies whether PII markers in the text should uniquely identify PII.

  • marker_format: string or null (default: “[CLASS_NAME]”)

    Specify a custom redaction marker format. The format must always contain ‘CLASS_NAME’, which will be replaced by the entity class. E.g. “<<CLASS_NAME>>”, “-CLASS_NAME-”. It is also possible to set this option via environment variable. See Environment Variables

  • allow_list: array(string) (default: [ ])

    Any entities in this list will be discarded. Note that this feature does not support regexes and the match is case-insensitive. If the allow list is [“maxim”, “Kandeep”], possible matches that will be discarded are “maxim”, “MAxim”, “MAXIM”, “kandeep”, “kANdeep”. It is also possible to set this option via environment variable. See Environment Variables

  • block_list: object (default: None)

    The block list feature allows you to extend the functionality of the Private AI models by using regular expressions. This way, you can define a Python regex pattern that will be used to identify additional tokens with the given PII label.

    The block list feature supports multiple regex patterns. There are passed as a json object with the key representing a label and the value a regex pattern, for example { “CUSTOM_LABEL”: “custom” }. It is possible to pass multiple LABEL-REGEX pairs to the object. So the follow example is also a valid use case: { “CUSTOM1”: “custom”, “CUSTOM2”: “other” }.

    Since this feature uses regex patterns, you can either pass a word (e.g. the, word, custom. etc.) or you can pass a valid Python regex pattern. It is important to note that regex patterns may require escaping when used in json objects. To give an example, if you would like to send the regex pattern r”bw{4}b” which will catch every 4-characters word, you need to send it as “\b\w{4}\b”. A complete JSON grammar is found here: https://www.json.org/json-en.html. More information on how to write a python regex is found here: https://docs.python.org/3/library/re.html

    It is important to note also that only non-overlapping matches are returned.

    Lastly, for supported labels, if you would like the model to pick up only the tokens from the block list, you can use the enabled classes feature together with the block list feature. This can be done by defining a list of enabled classes and not including the supported label you are adding to the block list. For example, if you would like the label “ORGANIZATION” to only pick up Microsoft, you can define the enabled classes as [“NAME”, “LOCATION”, “AGE”, …] (and omitting ORGANIZATION) and the block list as {“ORGANIZATION”: “Microsoft”}.

  • fake_entity_accuracy_mode (beta): string (default: None)

    Enable fake entity generation using the specified model. Currently this feature is in beta and only supports mode “standard”.

  • preserve_relationships (beta): bool (default: true)

    Specifies whether multiple instances of the same entity should have the same generated fake entity or not. For example, preserve relationships: “Hi John and Rosha, John nice to meet you” -> “Hi Harry and Alev, Harry nice to meet you”. No preserved relationships: “Hi John and Rosha, John nice to meet you” -> “Hi Harry and Alev, Sulav nice to meet you”. This field as no effects when fake_entity_accuracy_mode is not set.

Minimal example:

{
   "text": "Hello Paul, how are you?",
   "key": "<customer key>"
}

Minimal example with batched de-identification:

{
   "text": [
      "Hello Paul, how are you?",
      "My address is 123 Example Street."
   ],
   "key": "<customer key>"
}

Response Body

The API returns a JSON object containing the following fields:

  • result: string or array(string)

    The de-identified string(s).

  • result_fake (beta): string or array(string)

    The pseudonymized (fake) string(s) with each entity found replaced by a generated entity.

  • pii: array(object)

    A list of all entities found in the text. Each PII entry has the following fields:

    • marker: string

      The corresponding marker in the de-identified text (result field), where the entity exists

    • text: string

      The entity text

    • best_label: string

      The entity label with the highest likelihood

    • stt_idx: int

      Start character index of the entity, in the original text

    • end_idx: int

      End character index of the entity, in the original text

    • labels: dictionary

      A dictionary of all possible labels, together with associated likelihoods. Note that these are not strictly probabilities and do not sum to 1, as a word can belong to multiple classes. The scores have also been thresholded, so no additional thresholding is necessary.

    • fake_text (beta): string

      The fake entity that was generated to replace the original

    • fake_stt_idx (beta): int

      Start character index of the fake entity, in the pseudonymized/fake text

    • fake_end_idx (beta): int

      End character index of the fake entity, in the pseudonymized/fake text

  • api_calls_used: int

    The number of API credits used to process a request, where a credit is defined as 100 words and a word is a whitespace separated piece of text

  • output_checks_passed: bool

    Reports whether the output validity checks passed or not. These checks test whether:

    1. replacing each entity marker with the corresponding information matches the input

    2. every entity marker is bounded by whitespace or punctuation

Sample Commands

Below are some sample commands and corresponding outputs displaying the different options. It is possible to test the API without having to install the container by using Private AI’s demo server. Please contact Private AI for the endpoint address. For example usage in Python & Javascript, please refer to our example repo here: https://github.com/privateai/deid-examples

A minimal example requires 2 inputs - the text to process and an API key. Default behaviour is to create unique PII markers:
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "Hi John, my name is Grace. John, could you pass me the salt please?", "key": "<customer key>"}'
{
  "result": "Hi [NAME_1], my name is [NAME_2]. [NAME_1], could you pass me the salt please?",
  "result_fake": null,
  "pii": [
    {
      "marker": "NAME_1",
      "text": "John",
      "best_label": "NAME",
      "stt_idx": 3,
      "end_idx": 7,
      "labels": {
        "NAME": 0.8332
      }
    },
    {
      "marker": "NAME_2",
      "text": "Grace",
      "best_label": "NAME",
      "stt_idx": 20,
      "end_idx": 25,
      "labels": {
        "NAME": 0.8306
      }
    },
    {
      "marker": "NAME_1",
      "text": "John",
      "best_label": "NAME",
      "stt_idx": 27,
      "end_idx": 31,
      "labels": {
        "NAME": 0.8325
      }
    }
  ],
  "api_calls_used": 1,
  "output_checks_passed": true
}
Text of any supported language (English, French, Spanish, Italian, German, Portuguese & Korean) can be processed by setting the accuracy_mode to a multilingual model. There is no need to specify the input language:
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "Hallo Günther, ich bin Pieter", "key": "<customer key>", "accuracy_mode": "high_multilingual"}'
{
  "result": "Hallo [NAME_1], ich bin [NAME_2]",
  "result_fake": null,
  "pii": [
    {
      "marker": "NAME_1",
      "text": "Günther",
      "best_label": "NAME",
      "stt_idx": 6,
      "end_idx": 13,
      "labels": {
        "NAME": 0.8303
      }
    },
    {
      "marker": "NAME_2",
      "text": "Pieter",
      "best_label": "NAME",
      "stt_idx": 23,
      "end_idx": 29,
      "labels": {
        "NAME": 0.8384
      }
    }
  ],
  "api_calls_used": 1,
  "output_checks_passed": true
}
Non-unique PII markers can be created using the “unique_pii_markers” option. Note that every piece of PII still has a unique entry in the PII list:
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace", "unique_pii_markers": false, "key": "<customer key>"}'
{
  "result": "My name is [NAME] and my friend is [NAME]",
  "pii": [
     {
        "marker": "NAME",
        "text": "John",
        "best_label": "NAME",
        "stt_idx": 11,
        "end_idx": 15,
        "labels": {"NAME": 0.923}
     },
     {
        "marker": "NAME",
        "text": "Grace",
        "best_label": "NAME",
        "stt_idx": 33,
        "end_idx": 38,
        "labels": {"NAME": 0.9135}
     }
  ],
  "api_calls_used": 1,
  "output_checks_passed": true
}
It is possible to restrict the PII classes that the Private AI API looks for by passing a list of desired classes:
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace and we live in Barcelona", "key": "<customer key>", "enabled_classes": ["AGE", "LOCATION"]}'
{
  "result": "My name is John and my friend is Grace and we live in [LOCATION_1]",
  "pii": [
     {
        "marker": "LOCATION_1",
        "text": "Barcelona",
        "best_label": "LOCATION",
        "stt_idx": 54,
        "end_idx": 63,
        "labels": {"LOCATION": 0.9211}
     }
  ],
  "api_calls_used": 1,
  "output_checks_passed": true
}
Multiple inputs can be processed using a single POST call, simply by passing a list of strings instead of a single string:
 $ curl -X POST http://localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": ["My password is: 4XDX63F8O1", "My password is: 33LMVLLDHNasdfsda"], "key": "INTERNAL_TESTING_DEMO_REALLY"}'
 [
   {
      "result":"My password is: [PASSWORD_1]",
      "result_fake":null,
      "pii":[
         {
            "marker":"PASSWORD_1",
            "text":"4XDX63F8O1",
            "best_label":"PASSWORD",
            "stt_idx":16,
            "end_idx":26,
            "labels":{"PASSWORD":0.9346}
         }
      ],
      "api_calls_used":1,
      "output_checks_passed":true
   },
   {
      "result":"My password is: [PASSWORD_1]",
      "result_fake":null,
      "pii":[
         {
            "marker":"PASSWORD_1",
            "text":"33LMVLLDHNasdfsda",
            "best_label":"PASSWORD",
            "stt_idx":16,
            "end_idx":33,
            "labels":{"PASSWORD":0.9312}
         }
      ],
      "api_calls_used":1,
      "output_checks_passed":true
   }
]
Batching with link_batch enabled to pool context between different inputs. This comes in handy for example when processing chat logs or transcripts. In the example below, the identifiers are in the 1st and 3rd messages, whilst the PII resides in the 2nd and 4th messages. Enabling link_batch allows the model to see the full context of the chat, and classify the PII accordingly.
$ curl -X POST http://localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": ["Hi, my name is Penelope, could you tell me your phone number please?", "Sure, x234", "and your DOB please?", "fourth of Feb nineteen 86"], "link_batch": true, "key": "<customer key>"}'
[
  {
    "result": "Hi, my name is [NAME_1], could you tell me your phone number please?",
    "result_fake": null,
    "pii": [
      {
        "marker": "NAME_1",
        "text": "Penelope",
        "best_label": "NAME",
        "stt_idx": 15,
        "end_idx": 23,
        "labels": {
          "NAME": 0.8282
        }
      }
    ],
    "api_calls_used": 1,
    "output_checks_passed": true
  },
  {
    "result": "Sure, [PHONE_NUMBER_1]",
    "result_fake": null,
    "pii": [
      {
        "marker": "PHONE_NUMBER_1",
        "text": "x234",
        "best_label": "PHONE_NUMBER",
        "stt_idx": 6,
        "end_idx": 10,
        "labels": {
          "PHONE_NUMBER": 0.8424
        }
      }
    ],
    "api_calls_used": 1,
    "output_checks_passed": true
  },
  {
    "result": "and your DOB please?",
    "result_fake": null,
    "pii": [],
    "api_calls_used": 1,
    "output_checks_passed": true
  },
  {
    "result": "[DOB_1]",
    "result_fake": null,
    "pii": [
      {
        "marker": "DOB_1",
        "text": "fourth of Feb nineteen 86",
        "best_label": "DOB",
        "stt_idx": 0,
        "end_idx": 25,
        "labels": {
          "DOB": 0.8794
        }
      }
    ],
    "api_calls_used": 1,
    "output_checks_passed": true
  }
]
Specifying allow_list allows certain entities to be passed through. This is handy when dealing with common terms that might not be sensitive, such as the name of the customer that is public record:
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "Hello Xavier, Rudolph here. Did you see Jane?", "allow_list": ["Xavier", "Rudolph"], "key": "<customer key>"}'
 {
   "result": "Hello Xavier, Rudolph here. Did you see [NAME_1]?",
   "result_fake": null,
   "pii": [
     {
       "marker": "NAME_1",
       "text": "Jane",
       "best_label": "NAME",
       "stt_idx": 40,
       "end_idx": 44,
       "labels": {
         "NAME": 0.8311
       }
     }
   ],
   "api_calls_used": 1,
   "output_checks_passed": true
 }
Similar to the allow list, Private AI’s API also supports blocking specified entities using Python regexes:
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "Hello Xavier, Rudolph here. Did you see Jane?", "block_list": {"CUSTOMER_NAME": "Rudolph"}, "key": "<customer key>"}'
 {
   "result": "Hello [NAME_1], [CUSTOMER_NAME_1] here. Did you see [NAME_2]?",
   "result_fake": null,
   "pii": [
     {
       "marker": "NAME_1",
       "text": "Xavier",
       "best_label": "NAME",
       "stt_idx": 6,
       "end_idx": 12,
       "labels": {
         "NAME": 0.8324
       }
     },
     {
       "marker": "CUSTOMER_NAME_1",
       "text": "Rudolph",
       "best_label": "CUSTOMER_NAME",
       "stt_idx": 14,
       "end_idx": 21,
       "labels": {
         "CUSTOMER_NAME": 1
       }
     },
     {
       "marker": "NAME_2",
       "text": "Jane",
       "best_label": "NAME",
       "stt_idx": 40,
       "end_idx": 44,
       "labels": {
         "NAME": 0.8311
       }
     }
   ],
   "api_calls_used": 1,
   "output_checks_passed": true
 }
Generating synthetic PII can easily be done by setting the fake_entity_accuracy_mode:
$ curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "My name is John and my friend is Grace and we live in Barcelona", "key": "<customer key>", "fake_entity_accuracy_mode": "standard"}'
{
  "result": "My name is [NAME_1] and my friend is [NAME_2] and we live in [LOCATION_1]",
  "result_fake": "My name is Sarah and my friend is Sarah and we live in California",
  "pii": [
     {
        "marker": "NAME_1",
        "text": "John",
        "best_label": "NAME",
        "stt_idx": 11,
        "end_idx": 15,
        "labels": {"NAME":0.9061},
        "fake_text": ["Sarah"],
        "fake_stt_idx": 11,
        "fake_end_idx": 16
     },
     {
        "marker": "NAME_2",
        "text": "Grace",
        "best_label": "NAME",
        "stt_idx": 33,
        "end_idx": 38,
        "labels": {"NAME": 0.9032},
        "fake_text": ["Sarah"],
        "fake_stt_idx": 34,
        "fake_end_idx": 39
     },
     {
        "marker": "LOCATION_1",
        "text": "Barcelona",
        "best_label": "LOCATION",
        "stt_idx": 54,
        "end_idx": 63,
        "labels": {"LOCATION": 0.8985},
        "fake_text": ["California"],
        "fake_stt_idx": 55,
        "fake_end_idx": 65
     }
  ],
  "api_calls_used": 1,
  "output_checks_passed": true
}

Performance Tips

Private AI’s solution uses AI to detect PII based on context. Therefore, for best performance it is advisable to send text through in the largest possible chunks that still meet latency requirements. For example, the following chat log should be sent through in one call with link_batch enabled, as opposed to line-by-line:

“Hi John, how are you?”

“I’m good thanks”

“Great, hope Atlanta is treating you well”

Similarly, text documents should be sent through in a single request, rather than by paragraph or sentence. In addition to improving accuracy, this will minimize the number of API calls made.

When processing audio transcripts, it is recommended to use the following input format:

“<speaker id>: <message>, <speaker id>: <message>, “

Finally, the AI model has also been optimised for normal English capitalization, e.g. “Robert is from Sydney, Australia. Muhab is from Wales”. If this is not the case for your data, please contact Private AI so that we can provide you with the optimal model for your use case. Our solution will still work, but some performance will be lost.