Get started - anonymization

In this section we continue from here and show how to do the anonymization. Make sure you have set the right paths and credentials to your database and the S3 bucket and you’ve downloaded a model.

from rlike import *
from blackbar import blackbar_s3_download
info = blackbar_s3_download(name = "deid_v2", bucket = "blackbar-models")

5. Anonymize your text

Once the model is downloaded, you can load the model with function Blackbar, and anonymize text by replacing the entities with an _, X or the entity label
The result returns the raw text, the anonymized text and the exact locations of the detected entities based on the deep learning model only

from blackbar import Blackbar, blackbar_example, blackbar_s3_download
deid = Blackbar(info)
text = blackbar_example(type = "document")
print(text)

dagnota : 10/4/2021

Opname D12
RvO/ SAB op onderliggend aneurysma a. communicans anterior - 30/03 coiling gehad.

A/
Geen bijzonderheden.

O/
114/60 - 36.1C - 60/min - 98% zonder O2
Tele zonder bijzonderheden

Labo/
Geen bijzonderheden

TCD 09/04: geen vaatspasmen

P/
- Monitoring tot en met maandag (14 dagen), dan naar A480
 | Uitvoerder: Dr. Jan Janssens |
 | Verantwoordelijke: Dr. Jan Janssens |

You can use the deid model to anonymize your text

anno = deid.anonymize(text, type = "entity_label")
anno = deid.anonymize(text, type = "_")
anno = deid.anonymize(text, type = "X", as_data_frame = True)
print(anno["text"])

dagnota : XXXXXXXXX

Opname D12
RvO/ SAB op onderliggend aneurysma a. communicans anterior - XXXXX coiling gehad.

A/
Geen bijzonderheden.

O/
114/60 - 36.1C - 60/min - 98% zonder O2
Tele zonder bijzonderheden

Labo/
Geen bijzonderheden

TCD XXXXX: geen vaatspasmen

P/
- Monitoring tot en met maandag (14 dagen), dan naar A480
 | Uitvoerder: XXXXXXXXXXXXXXXX |
 | Verantwoordelijke: XXXXXXXXXXXXXXXX |

anno["entities"]

	doc_id	label_	term_search	term_detected	similarity	start	end	length
0	None	03_Datum	None	10/4/2021	None	10	18	9
1	None	03_Datum	None	30/03	None	93	97	5
2	None	03_Datum	None	09/04	None	242	246	5
3	None	01_Naam	None	Dr. Jan Janssens	None	343	358	16
4	None	01_Naam	None	Dr. Jan Janssens	None	384	399	16

You can as well anonymize a pandas dataframe with one document. The pandas dataframe should have at least the columns doc_id and text.

from blackbar import Blackbar, blackbar_example, deid_anonymize
import pandas as pd
deid = Blackbar(info)
text = blackbar_example(type = "document")
docs = pd.DataFrame({"doc_id": ["abc"], "text": [text]})
anno = deid.anonymize_dataframe(docs, type = "entity_label")
anno = deid.anonymize_dataframe(docs, type = "_")
anno = deid.anonymize_dataframe(docs, type = "X")
anno["text"]
anno["entities"]

	doc_id	model	label_	term_detected	start	end	length
0	abc	nlp	03_Datum	10/4/2021	10	18	9
1	abc	nlp	03_Datum	30/03	93	97	5
2	abc	nlp	03_Datum	09/04	242	246	5
3	abc	nlp	01_Naam	Dr. Jan Janssens	343	358	16
4	abc	nlp	01_Naam	Dr. Jan Janssens	384	399	16

If you have several documents, use deid_anonymize to perform the anonymization of the whole set of documents. All the results will be put in an new column called textCvt containing a json with all the entities and the anonymized text.

# If you have a data frame with more than one record, use deid_anonymize to anonymize these
docs = pd.DataFrame({"doc_id": [1, 2], "text": [text, text]})
anno = deid_anonymize(deid, docs, type = "_", extended = False)
anno

	doc_id	text	textCvt
0	1	dagnota : 10/4/2021\n\nOpname D12\nRvO/ SAB op...	{"text_raw": "dagnota : 10/4/2021\n\nOpname D1...
1	2	dagnota : 10/4/2021\n\nOpname D12\nRvO/ SAB op...	{"text_raw": "dagnota : 10/4/2021\n\nOpname D1...

6. Anonymize your text in the database

If you have data in the format as explained in the the database tutorial, you can also extend the anonymization by doing a detection of known names/addresses for these documents based on Smith-Waterman. This will apply the deep learning model as well as the Smith-Waterman alignment. This hybrid approach improves the detection of the entities.

Example with query on the database

Connect to the database
Get documents & physicians which the patient got into contact with read_documents type deid

from blackbar import BlackbarDB, Blackbar, blackbar_s3_download, deid_anonymize
from rlike import *
db   = BlackbarDB('test')
docs = db.read_documents(ids = [1, 2, 3], type = "deid")

Get the model + anonymize the documents

info = blackbar_s3_download(name = "deid_v2", bucket = "blackbar-models")
deid = Blackbar(info)
anno = deid_anonymize(deid, docs, type = "_", extended = True)
anno[["ID", "text", "textCvt"]]

	ID	text	textCvt
0	1	dagnota : 10/4/2021\n\nOpname D12\nRvO/ SAB op...	{"doc_id": "1", "text_raw": "dagnota : 10/4/20...
1	2	dagnota : 10/4/2021\n\nOpname D12\nRvO/ SAB op...	{"doc_id": "2", "text_raw": "dagnota : 10/4/20...
2	3	\n\n\n\n\n\n\nDR. GAME KEEPERS\n\nElektronisch...	{"doc_id": "3", "text_raw": "\n\n\n\n\n\n\nDR....

Store the results in the database in the textCvt column with textCvtStatus set to 1

db.update_anonimisation(anno, status = 1, type = "anonymization")

Example with test data showing the logic of integrating the available information

The resulting dataset contains - the text - the known names/addresses of the patient - the known names/addresses of the physicians the patient ever got into contact with

from blackbar import BlackbarDB, Blackbar, blackbar_s3_download, blackbar_example, deid_anonymize, deid_enrich_identifiers
from rlike import *
## Get example documents & physicians which the patient got into contact with
docs      = blackbar_example('documents')
docs      = deid_enrich_identifiers(docs, type = "patients")
physician = blackbar_example('patients_physicians')
physician = deid_enrich_identifiers(physician, type = "patients_physicians")
docs      = docs.merge(physician, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
## Get the model + Anonymize the documents as follows 
info = blackbar_s3_download(name = "deid_v2", bucket = "blackbar-models")
deid = Blackbar(info)
anno = deid_anonymize(deid, docs, type = "_", extended = True)
anno

	ID	doc_id	patientId	performingPhysicianId	responsiblePhysicianId	pat_vnaam	pat_fam_naam	pat_adres	pat_post_nr	pat_gemeente	...	pat_anvn	pat_vnan	pat_vnanafk	pat_adresgegevens	pat_postcode_gemeente	pat_dos_nr	physician_adres	physician_naam_kort	physician_naam_lang	textCvt
0	1	1	Y871129di17X	JANSSENS, JAN	None	Mehdi	Olek	Grote Beek 9	1000	Brussel	...	Olek Mehdi	Mehdi Olek	M. Olek	Grote Beek 9 1000 Brussel	1000 Brussel	Y871129di17X	[Stormy Daneelsstraat 125, Stationsstraat 321,...	[Janssens, Peeters]	[J. Janssens, P. Peeters, Janssens Jan, Peeter...	{"doc_id": "1", "text_raw": "dagnota : 10/4/20...
1	2	2	Y871129di17X	PEETERS, PIET	JANSSENS, JAN	Jos	Vermeulen	Laarbeeklaan 99	1090	Jette	...	Vermeulen Jos	Jos Vermeulen	J. Vermeulen	Laarbeeklaan 99 1090 Jette	1090 Jette	Y871129di17X	[Stormy Daneelsstraat 125, Stationsstraat 321,...	[Janssens, Peeters]	[J. Janssens, P. Peeters, Janssens Jan, Peeter...	{"doc_id": "2", "text_raw": "dagnota : 10/4/20...
2	3	3	A948023ZRXYZ98M	Mr. T. De Groote	Prof. Rodelbaan	Roodel	Bahn	Rue du spoed 123	6543	Wemmel	...	Bahn Roodel	Roodel Bahn	R. Bahn	Rue du spoed 123 6543 Wemmel	6543 Wemmel	A948023ZRXYZ98M	[BRUSSELSESTEENWEG 123, BRUSSELSESTEENWEG 321,...	[De Groote, Rodel, Janssens]	[T. De Groote, B. Rodel, J. Janssens, De Groot...	{"doc_id": "3", "text_raw": "\n\n\n\n\n\n\nDR....

3 rows × 32 columns