Training

In very high level, the process for improving the detection of Personally Identifiable Information (PII) involves to annotate training data using Inception as shown here

Get training data

Once you have set up and annotated your texts, you can easily fetch the project you’ve set up in Inception and next get you annotations.

from blackbar import inception_list_projects
projects = inception_list_projects()
projects

    project_id                              project_name
0            1                               testproject
1            2                     penningkohieren-aalst
2            3  recommendation-and-active-learning-examp
3            4                            test-sentiment
4            8                             getuigenissen
5           14                             multipleroles
6           19                        getuigenissen-brat
7           20           getuigenissen---brat-annotaties
8           21                                   testuzb
9           37                      verslagen-radiologie
10         115                    verslagen-radiologie-1
11         158                           blackbar-pseudo
12         165                   blackbar-example-pseudo

In order to get the annotations from Inception, we use the Inception API
The following code will give you the list of documents where are on Inception, the annotations performed as well as the labelled entities for a specific set of projects

from blackbar import blackbar_inception_annotations, blackbar_traindata, deid
#anno = blackbar_inception_annotations(projects = ['getuigenissen'], annotation_state = ["COMPLETE", "IN-PROGRESS"])
anno = blackbar_inception_annotations(projects = ['testuzb'], annotation_state = ["COMPLETE", "IN-PROGRESS"])
anno["docs"]

   project_id doc_id  ...                   state          label
0          21  10722  ...  ANNOTATION-IN-PROGRESS  voorbeeld.txt

[1 rows x 5 columns]

anno["anno"]

   project_id  ...                                                cas
0          21  ...  b'PK\x03\x04\x14\x00\x08\x08\x08\x00\xd8v\xc6Z...

[1 rows x 9 columns]

#anno["entities"]

From this dataset, you can construct a dataset which can be used for modelling.
This is done with funciton blackbar_traindata which splits the annotations into a training and test dataset, uses the spacy base model specific to your language which will take care of the tokenization and it will combine all the annotations in DocBins

modeldata = blackbar_traindata(anno, base = "nl_core_news_md", train_size = 0.7)  
modeldata["train"]

<spacy.tokens._serialize.DocBin object at 0x7fda22bfe890>

modeldata["test"]

<spacy.tokens._serialize.DocBin object at 0x7fda22bfc8e0>

Train the model

Based on these Docbins, you can either

Build a custom Spacy NER model or
Use some pretrained configurations to quickly build a Named Entity Recognition model. Predefined configurations are available for
- a Convolutional Neural Network (CNN)
- a Bidirectional LSTM model (BiLSTM)
- a Transformer model (Transformer)
The configurations are put in the blackbar-py package: https://github.com/bnosac/blackbar-py/blob/main/src/blackbar/models/config and changes to the parameters of the configurations can be passed on to the options argument.
The best model with a good tradeoff of speed versus accuracy to run on local infrastructure is a BiLSTM model

model = deid(data = modeldata, type = "BiLSTM")   
model = deid(data = modeldata, type = "CNN")   
nlp = model.train(language = "nl", options = {"training.max_steps": 500}, output_path = "mymodel")  
nlp = model.train(language = "nl", output_path = "mymodel")

Save

Once you have the model, save it for deployment with a certain name in a bucket on your configured S3/Minio location.

from blackbar import blackbar_s3_upload
blackbar_s3_upload(nlp, "my-model-run-cnn", bucket = "blackbar-models")

This model can then be used in the automation flows or to test out the model on local data

If you want to customize the training of the model, you can pass on to deid in the options a set of training options which will override the default configuration settings of the Spacy model. You can also pass on to type the path to a spacy model configuration file.

Run the training as a docker container

You can also build the model by launching the docker container blackbar-modelling.

docker pull registry.datatailor.be/blackbar-modelling:latest
docker run --rm registry.datatailor.be/blackbar-modelling --help
docker run --rm --env-file .env registry.datatailor.be/blackbar-modelling --help

The following commands will run the CNN/BiLSTM model on all data available in Inception and will save the model on S3

docker run --rm --env-file .env registry.datatailor.be/blackbar-modelling --model_type CNN    --model_base nl_core_news_md --model_language nl --name mymodelname_cnn    --bucket bnosac 
docker run --rm --env-file .env registry.datatailor.be/blackbar-modelling --model_type BiLSTM --model_base nl_core_news_md --model_language nl --name mymodelname_bilstm --bucket bnosac

You can also build models on specific Inception projects, store the dataset on S3 instead of each time fetching the data from the Inception API

docker run --rm --env-file .env registry.datatailor.be/blackbar-modelling --project_name spoed cardiologie --project_dataset testci.pickle --model_type CNN --model_base nl_core_news_md --model_language nl --name mymodelname --bucket bnosac

General advice

In very high level, the process for improving the detection of Personally Identifiable Information (PII) involves

Come up with a dataset with labels for PII entities. This is done in Inception.
Get the training data and train one of the default blackbar models and see how good the F2 score is giving a combination of recall and precision
Inspect how the model behaves on the data which is not used part of the training run
Inspect how the model behaves (false positives / false negatives) on other texts
Decide whether results could be improved by tweaking the hyperparameters of the model
Collect more training data or generate predictions with the model and send these to Inception or the frontend for evaluation
Possibly user other transfer learning techniques (build embedding models locally and plug these) or customize the configuration / training.
Iterate

Note

If you are running this locally, make sure correctly install the SSL certificates of the Minio server as well as the server where Inception runs We are using Python to connect to these and this requires having the environment variables REQUESTS_CA_BUNDLE / SSL_CERT_FILE to be set.

E.g. as follows

cp cert/* /usr/local/share/ca-certificates/
rm /etc/ssl/certs/ca-certificates.crt && update-ca-certificates
export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt