Get started - anonymization

In this section we continue from here and show how to do the anonymization. Make sure you have set the right paths and credentials to your database and the S3 bucket and you’ve downloaded a model.

from rlike import *
from blackbar import blackbar_s3_download
info = blackbar_s3_download(name = "deid_v2", bucket = "blackbar-models")

5. Anonymize your text

  • Once the model is downloaded, you can load the model with function Blackbar, and anonymize text by replacing the entities with an _, X or the entity label
  • The result returns the raw text, the anonymized text and the exact locations of the detected entities based on the deep learning model only
from blackbar import Blackbar, blackbar_example, blackbar_s3_download
deid = Blackbar(info)
text = blackbar_example(type = "document")
print(text)
dagnota : 10/4/2021

Opname D12
RvO/ SAB op onderliggend aneurysma a. communicans anterior - 30/03 coiling gehad.

A/
Geen bijzonderheden.

O/
114/60 - 36.1C - 60/min - 98% zonder O2
Tele zonder bijzonderheden

Labo/
Geen bijzonderheden

TCD 09/04: geen vaatspasmen

P/
- Monitoring tot en met maandag (14 dagen), dan naar A480
 | Uitvoerder: Dr. Jan Janssens |
 | Verantwoordelijke: Dr. Jan Janssens |
  • You can use the deid model to anonymize your text
anno = deid.anonymize(text, type = "entity_label")
anno = deid.anonymize(text, type = "_")
anno = deid.anonymize(text, type = "X", as_data_frame = True)
print(anno["text"])
dagnota : XXXXXXXXX

Opname D12
RvO/ SAB op onderliggend aneurysma a. communicans anterior - XXXXX coiling gehad.

A/
Geen bijzonderheden.

O/
114/60 - 36.1C - 60/min - 98% zonder O2
Tele zonder bijzonderheden

Labo/
Geen bijzonderheden

TCD XXXXX: geen vaatspasmen

P/
- Monitoring tot en met maandag (14 dagen), dan naar A480
 | Uitvoerder: XXXXXXXXXXXXXXXX |
 | Verantwoordelijke: XXXXXXXXXXXXXXXX |
anno["entities"]
doc_id label_ term_search term_detected similarity start end length
0 None 03_Datum None 10/4/2021 None 10 18 9
1 None 03_Datum None 30/03 None 93 97 5
2 None 03_Datum None 09/04 None 242 246 5
3 None 01_Naam None Dr. Jan Janssens None 343 358 16
4 None 01_Naam None Dr. Jan Janssens None 384 399 16
  • You can as well anonymize a pandas dataframe with one document. The pandas dataframe should have at least the columns doc_id and text.
from blackbar import Blackbar, blackbar_example, deid_anonymize
import pandas as pd
deid = Blackbar(info)
text = blackbar_example(type = "document")
docs = pd.DataFrame({"doc_id": ["abc"], "text": [text]})
anno = deid.anonymize_dataframe(docs, type = "entity_label")
anno = deid.anonymize_dataframe(docs, type = "_")
anno = deid.anonymize_dataframe(docs, type = "X")
anno["text"]
anno["entities"]
doc_id model label_ term_detected start end length
0 abc nlp 03_Datum 10/4/2021 10 18 9
1 abc nlp 03_Datum 30/03 93 97 5
2 abc nlp 03_Datum 09/04 242 246 5
3 abc nlp 01_Naam Dr. Jan Janssens 343 358 16
4 abc nlp 01_Naam Dr. Jan Janssens 384 399 16
  • If you have several documents, use deid_anonymize to perform the anonymization of the whole set of documents. All the results will be put in an new column called textCvt containing a json with all the entities and the anonymized text.
# If you have a data frame with more than one record, use deid_anonymize to anonymize these
docs = pd.DataFrame({"doc_id": [1, 2], "text": [text, text]})
anno = deid_anonymize(deid, docs, type = "_", extended = False)
anno
doc_id text textCvt
0 1 dagnota : 10/4/2021\n\nOpname D12\nRvO/ SAB op... {"text_raw": "dagnota : 10/4/2021\n\nOpname D1...
1 2 dagnota : 10/4/2021\n\nOpname D12\nRvO/ SAB op... {"text_raw": "dagnota : 10/4/2021\n\nOpname D1...

6. Anonymize your text in the database

  • If you have data in the format as explained in the the database tutorial, you can also extend the anonymization by doing a detection of known names/addresses for these documents based on Smith-Waterman. This will apply the deep learning model as well as the Smith-Waterman alignment. This hybrid approach improves the detection of the entities.

Example with query on the database

  • Connect to the database
  • Get documents & physicians which the patient got into contact with read_documents type deid
from blackbar import BlackbarDB, Blackbar, blackbar_s3_download, deid_anonymize
from rlike import *
db   = BlackbarDB('test')
docs = db.read_documents(ids = [1, 2, 3], type = "deid")
  • Get the model + anonymize the documents
info = blackbar_s3_download(name = "deid_v2", bucket = "blackbar-models")
deid = Blackbar(info)
anno = deid_anonymize(deid, docs, type = "_", extended = True)
anno[["ID", "text", "textCvt"]]
ID text textCvt
0 1 dagnota : 10/4/2021\n\nOpname D12\nRvO/ SAB op... {"doc_id": "1", "text_raw": "dagnota : 10/4/20...
1 2 dagnota : 10/4/2021\n\nOpname D12\nRvO/ SAB op... {"doc_id": "2", "text_raw": "dagnota : 10/4/20...
2 3 \n\n\n\n\n\n\nDR. GAME KEEPERS\n\nElektronisch... {"doc_id": "3", "text_raw": "\n\n\n\n\n\n\nDR....
  • Store the results in the database in the textCvt column with textCvtStatus set to 1
db.update_anonimisation(anno, status = 1, type = "anonymization")

Example with test data showing the logic of integrating the available information

The resulting dataset contains - the text - the known names/addresses of the patient - the known names/addresses of the physicians the patient ever got into contact with

from blackbar import BlackbarDB, Blackbar, blackbar_s3_download, blackbar_example, deid_anonymize, deid_enrich_identifiers
from rlike import *
## Get example documents & physicians which the patient got into contact with
docs      = blackbar_example('documents')
docs      = deid_enrich_identifiers(docs, type = "patients")
physician = blackbar_example('patients_physicians')
physician = deid_enrich_identifiers(physician, type = "patients_physicians")
docs      = docs.merge(physician, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
## Get the model + Anonymize the documents as follows 
info = blackbar_s3_download(name = "deid_v2", bucket = "blackbar-models")
deid = Blackbar(info)
anno = deid_anonymize(deid, docs, type = "_", extended = True)
anno
ID doc_id patientId performingPhysicianId responsiblePhysicianId pat_vnaam pat_fam_naam pat_adres pat_post_nr pat_gemeente ... pat_anvn pat_vnan pat_vnanafk pat_adresgegevens pat_postcode_gemeente pat_dos_nr physician_adres physician_naam_kort physician_naam_lang textCvt
0 1 1 Y871129di17X JANSSENS, JAN None Mehdi Olek Grote Beek 9 1000 Brussel ... Olek Mehdi Mehdi Olek M. Olek Grote Beek 9 1000 Brussel 1000 Brussel Y871129di17X [Stormy Daneelsstraat 125, Stationsstraat 321,... [Janssens, Peeters] [J. Janssens, P. Peeters, Janssens Jan, Peeter... {"doc_id": "1", "text_raw": "dagnota : 10/4/20...
1 2 2 Y871129di17X PEETERS, PIET JANSSENS, JAN Jos Vermeulen Laarbeeklaan 99 1090 Jette ... Vermeulen Jos Jos Vermeulen J. Vermeulen Laarbeeklaan 99 1090 Jette 1090 Jette Y871129di17X [Stormy Daneelsstraat 125, Stationsstraat 321,... [Janssens, Peeters] [J. Janssens, P. Peeters, Janssens Jan, Peeter... {"doc_id": "2", "text_raw": "dagnota : 10/4/20...
2 3 3 A948023ZRXYZ98M Mr. T. De Groote Prof. Rodelbaan Roodel Bahn Rue du spoed 123 6543 Wemmel ... Bahn Roodel Roodel Bahn R. Bahn Rue du spoed 123 6543 Wemmel 6543 Wemmel A948023ZRXYZ98M [BRUSSELSESTEENWEG 123, BRUSSELSESTEENWEG 321,... [De Groote, Rodel, Janssens] [T. De Groote, B. Rodel, J. Janssens, De Groot... {"doc_id": "3", "text_raw": "\n\n\n\n\n\n\nDR....

3 rows × 32 columns