In very high level, the process for improving the detection of Personally Identifiable Information (PII) involves to annotate training data using Inception as shown here
Get training data
Once you have set up and annotated your texts, you can easily fetch the project you’ve set up in Inception and next get you annotations.
from blackbar import inception_list_projectsprojects = inception_list_projects()projects
In order to get the annotations from Inception, we use the Inception API
The following code will give you the list of documents where are on Inception, the annotations performed as well as the labelled entities for a specific set of projects
project_id doc_id ... state label
0 21 10722 ... ANNOTATION-IN-PROGRESS voorbeeld.txt
[1 rows x 5 columns]
anno["anno"]
project_id ... cas
0 21 ... b'PK\x03\x04\x14\x00\x08\x08\x08\x00\xd8v\xc6Z...
[1 rows x 9 columns]
#anno["entities"]
From this dataset, you can construct a dataset which can be used for modelling.
This is done with funciton blackbar_traindata which splits the annotations into a training and test dataset, uses the spacy base model specific to your language which will take care of the tokenization and it will combine all the annotations in DocBins
modeldata = blackbar_traindata(anno, base ="nl_core_news_md", train_size =0.7) modeldata["train"]
<spacy.tokens._serialize.DocBin object at 0x7fda22bfe890>
modeldata["test"]
<spacy.tokens._serialize.DocBin object at 0x7fda22bfc8e0>
Train the model
Based on these Docbins, you can either
Build a custom Spacy NER model or
Use some pretrained configurations to quickly build a Named Entity Recognition model. Predefined configurations are available for
a Convolutional Neural Network (CNN)
a Bidirectional LSTM model (BiLSTM)
a Transformer model (Transformer)
The configurations are put in the blackbar-py package: https://github.com/bnosac/blackbar-py/blob/main/src/blackbar/models/config and changes to the parameters of the configurations can be passed on to the options argument.
The best model with a good tradeoff of speed versus accuracy to run on local infrastructure is a BiLSTM model
If you want to customize the training of the model, you can pass on to deid in the options a set of training options which will override the default configuration settings of the Spacy model. You can also pass on to type the path to a spacy model configuration file.
Run the training as a docker container
You can also build the model by launching the docker container blackbar-modelling.
docker pull registry.datatailor.be/blackbar-modelling:latest
docker run --rm registry.datatailor.be/blackbar-modelling --help
docker run --rm --env-file .env registry.datatailor.be/blackbar-modelling --help
The following commands will run the CNN/BiLSTM model on all data available in Inception and will save the model on S3
In very high level, the process for improving the detection of Personally Identifiable Information (PII) involves
Come up with a dataset with labels for PII entities. This is done in Inception.
Get the training data and train one of the default blackbar models and see how good the F2 score is giving a combination of recall and precision
Inspect how the model behaves on the data which is not used part of the training run
Inspect how the model behaves (false positives / false negatives) on other texts
Decide whether results could be improved by tweaking the hyperparameters of the model
Collect more training data or generate predictions with the model and send these to Inception or the frontend for evaluation
Possibly user other transfer learning techniques (build embedding models locally and plug these) or customize the configuration / training.
Iterate
Note
If you are running this locally, make sure correctly install the SSL certificates of the Minio server as well as the server where Inception runs We are using Python to connect to these and this requires having the environment variables REQUESTS_CA_BUNDLE / SSL_CERT_FILE to be set.