Anonymization and Pseudonymization
High level functionalities to perform Anonymization
- class Blackbar and the methods anonimise, anonimise_extended
- deid_anonimise_dataframe
High level functionalities to perform Pseudonymization
- Pseudonymization
- PseudoGenerator (generate pseudo text)
- anonimisation_entities (extract the anonimized entities)
- pseudo_replacements, txt_mimic_readability (replacement functions while sticking as close to the original layout as possible)
- Utility functions which help with the pseudonymization
- txt_insert, txt_freq, txt_leading_trailing
- txt_n_capital, txt_contains, txt_contains_lot_of_capitals, txt_n_newlines
from rlike import *
from blackbar.s3 import blackbar_s3_download
from blackbar.data import blackbar_example
from blackbar.uzb import Blackbar, deid_anonimise_dataframe
from blackbar.pseudonymisation import PseudoGenerator, anonimisation_entities
from blackbar.pseudotexts import pseudo_replacements
from blackbar.database import parse_anonimisation_json
Sys_setenv({
"BLACKBAR_S3_ENDPOINT": "blackbar.datatailor.be",
"BLACKBAR_S3_ACCESS_KEY_ID": "XXXXXXXXXX",
"BLACKBAR_S3_SECRET_ACCESS_KEY": "XXXXXXXXXX",
"BLACKBAR_S3_MODEL_BUCKET": "blackbar-models",
"BLACKBAR_S3_MODEL_NAME": "deid_v2"
})
info = blackbar_s3_download(name = Sys_getenv("BLACKBAR_S3_MODEL_NAME"), bucket = Sys_getenv("BLACKBAR_S3_MODEL_BUCKET"), folder = tempdir())
deid = Blackbar(info["folder"])
text = blackbar_example("example.txt")
anno = deid.anonimise(text, type = "_", as_data_frame = True)
entities = anno["entities"]
anno["text"]
Anonymization
blackbar.deid_smith_waterman(data, log=False, alignment_method='default')
Extract names/addresses based on Smith Waterman Either for patient names or for names of physicians that the patient got into contact with.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Series
|
the selction on 1 row from a pandas dataframe with the combination of enriched patient identifiers as extracted with deid_enrich_identifiers(type = 'patients') and identifiers of which physicians the patient got into contact with as extracted with deid_enrich_identifiers(type = 'patients_physicians') |
required |
log
|
bool
|
bool indicating to print the information in case of failure of the alignment |
False
|
alignment_method
|
str
|
either 'default' or 'biopython', passed on to Smith_Waterman |
'default'
|
Returns:
Type | Description |
---|---|
A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similariy, start, end, length with the detected entities and the exact positions in the text |
Examples:
>>> from blackbar import deid_enrich_identifiers, blackbar_example, deid_smith_waterman
>>> from rlike import *
>>> x = blackbar_example('patients_physicians')
>>> pats = deid_enrich_identifiers(x, type = "patients_physicians")
>>> docs = blackbar_example('documents')
>>> docs = deid_enrich_identifiers(docs, type = "patients")
>>> docs = docs.merge(pats, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "default")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "biopython")
>>> sw = deid_smith_waterman(docs.iloc[1], alignment_method = "biopython")
>>> data = docs.iloc[0]
>>> ents = deid_smith_waterman(data, alignment_method = "biopython")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0
>>> ents = deid_smith_waterman(data, alignment_method = "default")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0
blackbar.deid_anonymize(model, x=None, type='X', extended=False, log=False, output_structure='v3')
Anonymize the text in a pandas dataframe containing
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
Blackbar
|
an object of type Blackbar as returned by Blackbar |
required |
x
|
DataFrame
|
a pandas dataframe with at least the columns doc_id and text. If extended is True, this dataframe should include the following columns as returned by merging the data returned by deid_enrich_identifiers for type 'patients' and 'patients_physicians' - patientId, - pat_anvn, pat_vnan, pat_vnanafk, pat_vnaam, pat_fam_naam, - pat_adresgegevens, pat_postcode_gemeente, pat_adres - performingPhysicianId, performing_dr_anvn, performing_dr_vnan, performing_dr_vnanafk, performing_dr_an, performing_dr_vn - responsiblePhysicianId, responsible_dr_anvn, responsible_dr_vnan, responsible_dr_vnanafk, responsible_dr_an, responsible_dr_vn |
None
|
extended
|
bool
|
bool, indicating to perform extended anonymization, meaning the NLP model + Smith-Waterman or only the NLP model. Defaults to False. |
False
|
log
|
bool
|
bool indicating to print out an error in case smith_waterman fails for an unknown reason. Defaults to False. |
False
|
output_structure
|
str
|
type of json output structure. possible values are v3 or v4 where v4 does not include the dictionary element text. Defaults to 'v3' |
'v3'
|
Returns:
Type | Description |
---|---|
the pandas dataframe x with extra column called textCvt which contains the anonymized information put in a json structure. |
|
This json structure shows elements doc_id (if extended = True) |
|
text_raw (original text), text (anonymized text), |
|
entities (detected entities), n (number of detected entities), |
|
smith_waterman (entities found by Smith-Waterman) |
|
nlp (entities found by the deep learning model) |
|
logic (either smith_waterman|nlp or nlp), |
|
version (version of the output structure), |
|
model_name (the name of the nlp model used), |
|
timing (dictionary with start/end/duration indicating the time it took to perform the anonymization) |
Examples:
>>> from blackbar import blackbar_example, deid_anonymize, deid_enrich_identifiers, Blackbar, blackbar_s3_download
>>> ## Get example documents & physicians which the patient got into contact with
>>> docs = blackbar_example('documents')
>>> docs = deid_enrich_identifiers(docs, type = "patients")
>>> physician = blackbar_example('patients_physicians')
>>> physician = deid_enrich_identifiers(physician, type = "patients_physicians")
>>> docs = docs.merge(physician, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
>>> ## Get the model + Anonymize the documents as follows
>>> info = blackbar_s3_download(name = "deid_v2", bucket = "blackbar-models")
>>> deid = Blackbar(info)
>>> anno = deid_anonymize(deid, docs, type = "_", extended = True)
>>> anno = anno[["text", "textCvt"]]
>>> anno = deid_anonymize(deid, docs, type = "_", extended = True, output_structure = "v4")
>>> anno = anno[["text", "textCvt"]]
blackbar.deid_enrich_identifiers(data, type='patients')
Create combinations of names/address identifiers which can be use to lookup using Smith-Waterman. Either for patient names or for names of physicians that the patient got into contact with.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame
|
a pandas dataframe with the following fields If type is 'patients_physicians': pat_dos_nr, hcact_vnaam, hcact_fam_naam, hcact_adres hcact_post_nr, hcact_gemeente If type is 'patients': performingPhysicianId, responsiblePhysicianId and pat_vnaam, pat_fam_naam, pat_adres, pat_post_nr, pat_gemeente |
required |
type
|
str
|
either 'patients' or 'patients_physicians' |
'patients'
|
Returns:
Type | Description |
---|---|
If type is 'patients_physicians' returns: - a pandas dataframe with 1 row per patient with columns pat_dos_nr, physician_adres, physician_naam_kort, physician_naam_lang - the columns physician_adres, physician_naam_kort, physician_naam_lang indicate contain a list of all possible physicians the patient got into contact with, their address and different ways of writing these - the different ways are constructed by constructing an address - hcact_adres + hcact_post_nr + hcact_gemeente - hcact_post_nr + hcact_gemeente - the different ways are constructed by constructing a name (note only first name: hcact_vnaam is not used) - hcact_fam_naam - hcact_vnaam + hcact_fam_naam - hcact_fam_naam + hcact_vnaam - first letter of hcact_vnaam. + hcact_fam_naam (e.g. J. Janssens) |
|
If type is 'patients' returns the pandas DataFrame data with the following extra columns added/changed - Physician info - performingPhysicianId, responsiblePhysicianId: harmonising by removing 'Prof/Dr/Mevr/Mej/AP' prefixes people put before their names - performing_dr_anvn, performing_dr_vnan, performing_dr_vnanafk, performing_dr_an, performing_dr_vn (combining first name / last name based on information in performingPhysicianId) - responsible_dr_anvn, responsible_dr_vnan, responsible_dr_vnanafk, responsible_dr_an, responsible_dr_vn (combining first name / last name based on information in responsiblePhysicianId) - Patient info: - pat_anvn, pat_vnan, pat_vnanafk, pat_vnaam, pat_fam_naam (first name / last name combinations and J. Janssens) - pat_adresgegevens, pat_postcode_gemeente, pat_adres |
Examples:
>>> from blackbar import deid_enrich_identifiers, PseudoGenerator, blackbar_example
>>> import pandas as pd
>>> pseudo = PseudoGenerator()
>>> x = pd.DataFrame({"pat_dos_nr": pseudo.generate(type = "ID_Patient", n = 1), "hcact_vnaam": ["Jan", "Piet"], "hcact_fam_naam": ["Janssens", "Peeters"], "hcact_adres": ["Stormy Daneelsstraat 125", "Stationsstraat 321"], "hcact_post_nr": ["1000", "1090"], "hcact_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('patients_physicians')
>>> d = deid_enrich_identifiers(x, type = "patients_physicians")
>>> d = d[d["pat_dos_nr"] == "Y871129di17X"].reset_index()
>>> d["physician_naam_lang"][0]
['J. Janssens', 'P. Peeters', 'Janssens Jan', 'Peeters Piet', 'Jan Janssens', 'Piet Peeters']
>>> d["physician_adres"][0]
['Stormy Daneelsstraat 125', 'Stationsstraat 321', 'Stormy Daneelsstraat 125 1000 Brussel', 'Stationsstraat 321 1090 Jette', '1000 Brussel', '1090 Jette']
>>> x = pd.DataFrame({"patientId": pseudo.generate(type = "ID_Patient", n = 1), "performingPhysicianId": ["PROF.DR. JANSSENS, JAN", "DR. PEETERS, PIET"], "responsiblePhysicianId": [None, "PROF.DR. JANSSENS, JAN"], "pat_vnaam": ["Mehdi", "Jos"], "pat_fam_naam": ["Olek", "Vermeulen"], "pat_adres": ["Grote Beek 9", "Laarbeeklaan 99"], "pat_post_nr": ["1000", "1090"], "pat_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('documents')
>>> d = deid_enrich_identifiers(x, type = "patients")
>>> d = d[d["patientId"] == "Y871129di17X"].reset_index()
>>> d["performingPhysicianId"][0]
'JANSSENS, JAN'
>>> d["pat_anvn"][0]
'Olek Mehdi'
>>> d["pat_vnanafk"][0]
'M. Olek'
>>> d["pat_adresgegevens"][0]
'Grote Beek 9 1000 Brussel'
>>> d["pat_postcode_gemeente"][0]
'1000 Brussel'
blackbar.Blackbar
Pseudonymization
blackbar.deid_pseudonymize(data, pseudo=None, locale='nl_BE', dateshift={'years': -3, 'months': -3, 'days': -167, 'weeks': 4, 'days': 23}, failure_strategy=None, existing=pd.DataFrame(columns=['patient_id', 'entity_type', 'entity_text', 'entity_text_replacement', 'blackbar_comment']), model=None)
blackbar.deid_anonymization_entities(data)
Create pseudo PII texts
blackbar.pseudonymisation.PseudoGenerator
Generator of Pseudo elements
Parameters:
Name | Type | Description | Default |
---|---|---|---|
locale
|
str
|
'nl_BE' or 'fr_BE' |
'nl_BE'
|
type_unknown
|
str
|
either 'warning', 'error' or 'pass' indicating what do to if for a specific type, no implementation is provided yet. Defaults to 'warning'. |
'warning'
|
Returns:
Type | Description |
---|---|
a str with a fake element or a list of these fake elements |
Examples:
>>> from blackbar import *
>>> pseudo = PseudoGenerator(locale = "nl_BE")
>>> x = pseudo.generate(type = "ID_Rijksregister", n = 1)
>>> x = pseudo.generate(type = "ID_Rijksregister", n = 100)
>>> x = pseudo.generate(type = "ID_Patient", n = 100)
>>> x = pseudo.generate(type = "ID_Patient", text = ["B731918KN19P", "M061503GU21C"])
>>> x = pseudo.generate(type = "01_Naam", text = ['Jan Janssens', 'Dhr Jan Janssens', 'Mevr Angelina Jolie'])
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'])
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'], gender = "male")
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'], gender = "female")
>>> x = pseudo.generate(type = "01_Naam_Patient", text = ['Jan Janssens', 'Dhr Jan Janssens', 'Mevr Angelina Jolie'])
>>> x = pseudo.generate(type = "02_Adres_Locatie", n = 100)
>>> x = pseudo.generate(type = "02_Adres_Locatie", text = ['Dorpstraat 10, 1090 Brussel', 'Dorpstraat 10\n1090 Brussel', 'DORP 10\n1090 BRUSSEL'])
>>> x = pseudo.generate(type = "05_Beroep", n = 100)
>>> x = pseudo.generate(type = "Beroep", n = 100)
>>> x = pseudo.generate(type = "Organisatie", n = 100)
>>> x = pseudo.generate(type = "medical_title", n = 100)
>>> pseudo = PseudoGenerator(locale = "fr_BE")
>>> x = pseudo.generate(type = "ID_Rijksregister", n = 1)
>>> x = pseudo.generate(type = "ID_Rijksregister", n = 100)
>>> x = pseudo.generate(type = "ID_Patient", n = 100)
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'])
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'], gender = "male")
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'], gender = "female")
>>> x = pseudo.generate(type = "01_Naam_Patient", text = ['Jan Janssens', 'Dhr Jan Janssens', 'Mevr Angelina Jolie'])
>>> x = pseudo.generate(type = "02_Adres_Locatie", n = 100)
>>> x = pseudo.generate(type = "02_Adres_Locatie", text = ['Rue du Village 10, 1090 Bruxelles', 'Rue du Village 10\n1090 Bruxelles', 'VILLAGE 10\n1090 BRUXELLES'])
>>> x = pseudo.generate(type = "05_Beroep", n = 100)
>>> x = pseudo.generate(type = "Beroep", n = 100)
>>> x = pseudo.generate(type = "Organisatie", n = 100)
>>> x = pseudo.generate(type = "medical_title", n = 100)
Utilities to help find suitable replacements
blackbar.pseudonymisation.is_hospital(x)
blackbar.pseudonymisation.is_whitelisted(x, type='name')
blackbar.pseudonymisation.is_title_cased(x)
blackbar.pseudonymisation.is_upper_cased(x)
blackbar.pseudonymisation.is_hour_only(x)
blackbar.pseudonymisation.txt_is_numeric_only(x)
blackbar.pseudonymisation.txt_is_exact_date(x, remove=None, min_n_slash=1, max_n_slash=2, min_n_backslash=1, max_n_backslash=2, min_n_dot=1, max_n_dot=2, min_n_dash=2, max_n_dash=2)
blackbar.pseudonymisation.txt_is_exact_datetime(x)
blackbar.pseudonymisation.txt_extract_numbers(x, type='first')
blackbar.pseudonymisation.txt_remove_spaces(x, whitespace='\\s+')
blackbar.pseudonymisation.txt_contains_item(x, type='elderly_house')
blackbar.pseudonymisation.txt_contains_medical_title(x)
blackbar.pseudonymisation.txt_contains_formaltitle(x, gender=None)
blackbar.pseudonymisation.txt_is_email(x)
blackbar.pseudonymisation.txt_is_url(x)
blackbar.pseudonymisation.txt_is_phone_number(x)
blackbar.pseudonymisation.txt_replace_random_digits(x)
blackbar.pseudonymisation.format_date(x, text, faker)
blackbar.pseudonymisation.format_datetime(x, text, faker)
Insert pseudo PII in texts
blackbar.pseudotexts.txt_mimic_readability(x, y, titleCase=True)
blackbar.pseudotexts.pseudo_replacements(x, pseudonimizer, dateshift={'years': -3, 'months': -3, 'days': -167, 'weeks': 4, 'days': 23}, failure_strategy=None, model=None)
blackbar.utils.txt_insert(x, replacement, start, end=None, reverse=True)
Insert another text in a string
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A text string |
required |
replacement
|
Union[str, List[str]]
|
A text replacement |
required |
start
|
Union[int, List[int]]
|
start position in x where to put the replacement |
required |
end
|
Union[int, List[int], None]
|
end position in x where to end the insertion of the replacement |
None
|
reverse
|
bool
|
logical indicating to do the replacement starting from the back. Defaults to True. |
True
|
Returns:
Type | Description |
---|---|
x where the specified sections are replaced |
Examples:
>>> from blackbar import *
>>> x = 'Kung Warrior'
>>> txt_insert(x, replacement = 'Fu ', start = 5)
'Kung Fu Warrior'
>>> x = 'Kung ___ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 3)
'Kung Fu Warrior'
>>> x = 'Kung _ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 1)
'Kung Fu Warrior'
>>> x = 'My name is _NAME_ and I work in _LOC_. I am _AGE_ years old.'
>>> txt_insert(x, replacement = ['Jos', 'Dorpstraat 40, 1000 Brussel', '43'], start = [11, 32, 44], end = [16+1, 36+1, 48+1])
'My name is Jos and I work in Dorpstraat 40, 1000 Brussel. I am 43 years old.'
>>> #import re
>>> #[loc.start() for loc in re.finditer('_', x)]
blackbar.utils.txt_sample(x, n=1)
Sample from a list
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
List
|
A str |
required |
Returns:
Type | Description |
---|---|
List
|
A list with a sample of n elements from the List x |
Examples:
>>> from blackbar import *
>>> x = txt_sample(['a', 'b', 'c'], n = 2)