Data anonymization with AI
06 Aug 2023Data anonymization is the process of protecting private or sensitive information by erasing or encrypting identifiers that link an individual to stored data. This method is often used in situations where privacy is necessary, such as when sharing data or making it publicly available. The goal of data anonymization is to make it impossible (or at least very difficult) to identify individuals from the data, while still allowing the data to be useful for analysis and research purposes.
Before you will continue reading please watch short introduction:
I have decided to create a library which will help to simply anonymize data with high-performance. Thatโs why I have used Rust to code it. The library will use three algorithms which will anonymize data. Named Entity Recognition method enables the library to identify and anonymize sensitive named entities in your data, like names, organizations, locations, and other personal identifiers.
Here you can use existing models from HuggingFace for different languages for example:
- dslim bert-base-NER for english
- clarin FastPDN for polish
The models are based on external libraries like pytorch. To avoid external dependencies I have used rust tract library which is a rust onnx implementation.
To use models we need to convert them to onnx format using the transformers library.
import os
import transformers
from transformers import AutoModelForMaskedLM, AutoTokenizer, AutoModelForTokenClassification
from transformers.onnx import FeaturesManager
from pathlib import Path
from transformers import pipeline
model_id='dslim/bert-base-NER'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
feature='token-classification'
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature=feature)
onnx_config = model_onnx_config(model.config)
output_dir = "./dslim"
os.makedirs(output_dir, exist_ok=True)
# export
onnx_inputs, onnx_outputs = transformers.onnx.export(
preprocessor=tokenizer,
model=model,
config=onnx_config,
opset=13,
output=Path(output_dir+"/model.onnx")
)
print(onnx_inputs)
print(onnx_outputs)
tokenizer.save_pretrained(output_dir)
Now we are ready to use the NER algorithm. We can simply run docker images with a yaml configuration file where we define an anonymization pipeline.
pipeline:
- kind: ner
model_path: ./dslim/model.onnx
tokenizer_path: ./dslim/tokenizer.json
token_type_ids_included: true
id2label:
"0": ["O", false]
"1": ["B-MISC", true]
"2": ["I-MISC", true]
"3": ["B-PER", true]
"4": ["I-PER", true]
"5": ["B-ORG", true]
"6": ["I-ORG", true]
"7": ["B-LOC", true]
"8": ["I-LOC", true]
docker run -it -v $(pwd):/app/ -p 8080:8080 qooba/anonymize-rs server --host 0.0.0.0 --port 8080 --config config.yaml
For the NER algorithm we can configure if the predicted entity will be replaced or not. For the example request we will receive an anonymized response and replace items.
curl -X GET "http://localhost:8080/api/anonymize?text=I like to eat apples and bananas and plums" -H "accept: application/json" -H "Content-Type: application/json"
Response:
{
"text": "I like to eat FRUIT_FLASH0 and FRUIT_FLASH1 and FRUIT_REGEX0",
"items": {
"FRUIT_FLASH0": "apples",
"FRUIT_FLASH1": "banans",
"FRUIT_REGEX0": "plums"
}
}
If needed we can deanonymize the data using a separate endpoint.
curl -X POST "http://localhost:8080/api/deanonymize" -H "accept: application/json" -H "Content-Type: application/json" -d '{
"text": "I like to eat FRUIT_FLASH0 and FRUIT_FLASH1 and FRUIT_REGEX0",
"items": {
"FRUIT_FLASH0": "apples",
"FRUIT_FLASH1": "banans",
"FRUIT_REGEX0": "plums"
}
}'
Response:
{
"text": "I like to eat apples and bananas and plums"
}
If we prefer we can use the library from python code in this case we simply install the library. And we can use it in python.
We have discussed the first anonymization algorithm but what if it is not enough ? There are two additional methods. First is Flush Text algorithm which is a fast method for searching and replacing words in large datasets, used to anonymize predefined sensitive information. For flush text we can define configuration where we can read keywords in separate file where each line is a keyword or in the keyword configuration section.
The last method is simple Regex where we can define patterns which will be replaced.
We can combine several methods and build an anonymization pipeline:
pipeline:
- kind: ner
model_path: ./dslim/model.onnx
tokenizer_path: ./dslim/tokenizer.json
token_type_ids_included: true
id2label:
"0": ["O", false]
"1": ["B-MISC", true]
"2": ["I-MISC", true]
"3": ["B-PER", true]
"4": ["I-PER", true]
"5": ["B-ORG", true]
"6": ["I-ORG", true]
"7": ["B-LOC", true]
"8": ["I-LOC", true]
- kind: flashText
name: FRUIT_FLASH
file: ./tests/config/fruits.txt
keywords:
- apple
- banana
- plum
- kind: regex
name: FRUIT_REGEX
file: ./tests/config/fruits_regex.txt
patterns:
- \bapple\w*\b
- \bbanana\w*\b
- \bplum\w*\b
Remember that it uses automated detection mechanisms, and there is no guarantee that it will find all sensitive information. You should always ensure that your data protection measures are comprehensive and multi-layered.