Anonymization and Pseudonymization

High level functionalities to perform Anonymization

class Blackbar and the methods anonimise, anonimise_extended
deid_anonimise_dataframe

High level functionalities to perform Pseudonymization

Pseudonymization
- PseudoGenerator (generate pseudo text)
- anonimisation_entities (extract the anonimized entities)
- pseudo_replacements, txt_mimic_readability (replacement functions while sticking as close to the original layout as possible)
- Utility functions which help with the pseudonymization
  - txt_insert, txt_freq, txt_leading_trailing
  - txt_n_capital, txt_contains, txt_contains_lot_of_capitals, txt_n_newlines

from rlike import *
from blackbar.s3 import blackbar_s3_download
from blackbar.data import blackbar_example
from blackbar.uzb import Blackbar, deid_anonimise_dataframe
from blackbar.pseudonymisation import PseudoGenerator, anonimisation_entities
from blackbar.pseudotexts import pseudo_replacements  
from blackbar.database import parse_anonimisation_json
Sys_setenv({
    "BLACKBAR_S3_ENDPOINT": "blackbar.datatailor.be",
    "BLACKBAR_S3_ACCESS_KEY_ID": "XXXXXXXXXX",
    "BLACKBAR_S3_SECRET_ACCESS_KEY": "XXXXXXXXXX",
    "BLACKBAR_S3_MODEL_BUCKET": "blackbar-models",
    "BLACKBAR_S3_MODEL_NAME": "deid_v2"
})
info = blackbar_s3_download(name = Sys_getenv("BLACKBAR_S3_MODEL_NAME"), bucket = Sys_getenv("BLACKBAR_S3_MODEL_BUCKET"), folder = tempdir())
deid = Blackbar(info["folder"])
text = blackbar_example("example.txt")
anno = deid.anonimise(text, type = "_", as_data_frame = True)
entities = anno["entities"]
anno["text"]

Anonymization

`blackbar.deid_smith_waterman(data, log=False, alignment_method='default')`

Extract names/addresses based on Smith Waterman Either for patient names or for names of physicians that the patient got into contact with.

Parameters:

Name	Type	Description	Default
`data`	`Series`	the selction on 1 row from a pandas dataframe with the combination of enriched patient identifiers as extracted with deid_enrich_identifiers(type = 'patients') and identifiers of which physicians the patient got into contact with as extracted with deid_enrich_identifiers(type = 'patients_physicians')	required
`log`	`bool`	bool indicating to print the information in case of failure of the alignment	`False`
`alignment_method`	`str`	either 'default' or 'biopython', passed on to Smith_Waterman	`'default'`

Returns:

Type	Description
	A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similariy, start, end, length with the detected entities and the exact positions in the text

Examples:

>>> from blackbar import deid_enrich_identifiers, blackbar_example, deid_smith_waterman
>>> from rlike import *
>>> x = blackbar_example('patients_physicians')
>>> pats = deid_enrich_identifiers(x, type = "patients_physicians")
>>> docs = blackbar_example('documents')
>>> docs = deid_enrich_identifiers(docs, type = "patients")
>>> docs = docs.merge(pats, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "default")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "biopython")
>>> sw = deid_smith_waterman(docs.iloc[1], alignment_method = "biopython")
>>> data = docs.iloc[0]
>>> ents = deid_smith_waterman(data, alignment_method = "biopython")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0
>>> ents = deid_smith_waterman(data, alignment_method = "default")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0

`blackbar.deid_anonymize(model, x=None, type='X', extended=False, log=False, output_structure='v3')`

Anonymize the text in a pandas dataframe containing

Parameters:

Name	Type	Description	Default
`model`	`Blackbar`	an object of type Blackbar as returned by Blackbar	required
`x`	`DataFrame`	a pandas dataframe with at least the columns doc_id and text. If extended is True, this dataframe should include the following columns as returned by merging the data returned by deid_enrich_identifiers for type 'patients' and 'patients_physicians' - patientId, - pat_anvn, pat_vnan, pat_vnanafk, pat_vnaam, pat_fam_naam, - pat_adresgegevens, pat_postcode_gemeente, pat_adres - performingPhysicianId, performing_dr_anvn, performing_dr_vnan, performing_dr_vnanafk, performing_dr_an, performing_dr_vn - responsiblePhysicianId, responsible_dr_anvn, responsible_dr_vnan, responsible_dr_vnanafk, responsible_dr_an, responsible_dr_vn	`None`
`extended`	`bool`	bool, indicating to perform extended anonymization, meaning the NLP model + Smith-Waterman or only the NLP model. Defaults to False.	`False`
`log`	`bool`	bool indicating to print out an error in case smith_waterman fails for an unknown reason. Defaults to False.	`False`
`output_structure`	`str`	type of json output structure. possible values are v3 or v4 where v4 does not include the dictionary element text. Defaults to 'v3'	`'v3'`

Returns:

Type	Description
	the pandas dataframe x with extra column called textCvt which contains the anonymized information put in a json structure.
	This json structure shows elements doc_id (if extended = True)
	text_raw (original text), text (anonymized text),
	entities (detected entities), n (number of detected entities),
	smith_waterman (entities found by Smith-Waterman)
	nlp (entities found by the deep learning model)
	logic (either smith_waterman\|nlp or nlp),
	version (version of the output structure),
	model_name (the name of the nlp model used),
	timing (dictionary with start/end/duration indicating the time it took to perform the anonymization)

Examples:

>>> from blackbar import blackbar_example, deid_anonymize, deid_enrich_identifiers, Blackbar, blackbar_s3_download
>>> ## Get example documents & physicians which the patient got into contact with
>>> docs = blackbar_example('documents')
>>> docs = deid_enrich_identifiers(docs, type = "patients")
>>> physician = blackbar_example('patients_physicians')
>>> physician = deid_enrich_identifiers(physician, type = "patients_physicians")
>>> docs = docs.merge(physician, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
>>> ## Get the model + Anonymize the documents as follows 
>>> info = blackbar_s3_download(name = "deid_v2", bucket = "blackbar-models")
>>> deid = Blackbar(info)
>>> anno = deid_anonymize(deid, docs, type = "_", extended = True)
>>> anno = anno[["text", "textCvt"]]
>>> anno = deid_anonymize(deid, docs, type = "_", extended = True, output_structure = "v4")
>>> anno = anno[["text", "textCvt"]]

`blackbar.deid_enrich_identifiers(data, type='patients')`

Create combinations of names/address identifiers which can be use to lookup using Smith-Waterman. Either for patient names or for names of physicians that the patient got into contact with.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	a pandas dataframe with the following fields If type is 'patients_physicians': pat_dos_nr, hcact_vnaam, hcact_fam_naam, hcact_adres hcact_post_nr, hcact_gemeente If type is 'patients': performingPhysicianId, responsiblePhysicianId and pat_vnaam, pat_fam_naam, pat_adres, pat_post_nr, pat_gemeente	required
`type`	`str`	either 'patients' or 'patients_physicians'	`'patients'`

Returns:

Type

Description

If type is 'patients_physicians' returns: - a pandas dataframe with 1 row per patient with columns pat_dos_nr, physician_adres, physician_naam_kort, physician_naam_lang - the columns physician_adres, physician_naam_kort, physician_naam_lang indicate contain a list of all possible physicians the patient got into contact with, their address and different ways of writing these - the different ways are constructed by constructing an address - hcact_adres + hcact_post_nr + hcact_gemeente - hcact_post_nr + hcact_gemeente - the different ways are constructed by constructing a name (note only first name: hcact_vnaam is not used) - hcact_fam_naam - hcact_vnaam + hcact_fam_naam - hcact_fam_naam + hcact_vnaam - first letter of hcact_vnaam. + hcact_fam_naam (e.g. J. Janssens)

If type is 'patients' returns the pandas DataFrame data with the following extra columns added/changed - Physician info - performingPhysicianId, responsiblePhysicianId: harmonising by removing 'Prof/Dr/Mevr/Mej/AP' prefixes people put before their names - performing_dr_anvn, performing_dr_vnan, performing_dr_vnanafk, performing_dr_an, performing_dr_vn (combining first name / last name based on information in performingPhysicianId) - responsible_dr_anvn, responsible_dr_vnan, responsible_dr_vnanafk, responsible_dr_an, responsible_dr_vn (combining first name / last name based on information in responsiblePhysicianId) - Patient info: - pat_anvn, pat_vnan, pat_vnanafk, pat_vnaam, pat_fam_naam (first name / last name combinations and J. Janssens) - pat_adresgegevens, pat_postcode_gemeente, pat_adres

Examples:

>>> from blackbar import deid_enrich_identifiers, PseudoGenerator, blackbar_example
>>> import pandas as pd
>>> pseudo = PseudoGenerator()
>>> x = pd.DataFrame({"pat_dos_nr": pseudo.generate(type = "ID_Patient", n = 1), "hcact_vnaam": ["Jan", "Piet"], "hcact_fam_naam": ["Janssens", "Peeters"], "hcact_adres": ["Stormy Daneelsstraat 125", "Stationsstraat 321"], "hcact_post_nr": ["1000", "1090"], "hcact_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('patients_physicians')
>>> d = deid_enrich_identifiers(x, type = "patients_physicians")
>>> d = d[d["pat_dos_nr"] == "Y871129di17X"].reset_index()
>>> d["physician_naam_lang"][0]
['J. Janssens', 'P. Peeters', 'Janssens Jan', 'Peeters Piet', 'Jan Janssens', 'Piet Peeters']
>>> d["physician_adres"][0]
['Stormy Daneelsstraat 125', 'Stationsstraat 321', 'Stormy Daneelsstraat 125 1000 Brussel', 'Stationsstraat 321 1090 Jette', '1000 Brussel', '1090 Jette']
>>> x = pd.DataFrame({"patientId": pseudo.generate(type = "ID_Patient", n = 1), "performingPhysicianId": ["PROF.DR. JANSSENS, JAN", "DR. PEETERS, PIET"], "responsiblePhysicianId": [None, "PROF.DR. JANSSENS, JAN"], "pat_vnaam": ["Mehdi", "Jos"], "pat_fam_naam": ["Olek", "Vermeulen"], "pat_adres": ["Grote Beek 9", "Laarbeeklaan 99"], "pat_post_nr": ["1000", "1090"], "pat_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('documents')
>>> d = deid_enrich_identifiers(x, type = "patients")
>>> d = d[d["patientId"] == "Y871129di17X"].reset_index()
>>> d["performingPhysicianId"][0]
'JANSSENS, JAN'
>>> d["pat_anvn"][0]
'Olek Mehdi'
>>> d["pat_vnanafk"][0]
'M. Olek'
>>> d["pat_adresgegevens"][0]
'Grote Beek 9 1000 Brussel'
>>> d["pat_postcode_gemeente"][0]
'1000 Brussel'

`blackbar.Blackbar`

Pseudonymization

`blackbar.deid_pseudonymize(data, pseudo=None, locale='nl_BE', dateshift={'years': -3, 'months': -3, 'days': -167, 'weeks': 4, 'days': 23}, failure_strategy=None, existing=pd.DataFrame(columns=['patient_id', 'entity_type', 'entity_text', 'entity_text_replacement', 'blackbar_comment']), model=None)`

`blackbar.deid_anonymization_entities(data)`

Create pseudo PII texts

`blackbar.pseudonymisation.PseudoGenerator`

Generator of Pseudo elements

Parameters:

Name	Type	Description	Default
`locale`	`str`	'nl_BE' or 'fr_BE'	`'nl_BE'`
`type_unknown`	`str`	either 'warning', 'error' or 'pass' indicating what do to if for a specific type, no implementation is provided yet. Defaults to 'warning'.	`'warning'`

Returns:

Type	Description
	a str with a fake element or a list of these fake elements

Examples:

>>> from blackbar import *        
>>> pseudo = PseudoGenerator(locale = "nl_BE")
>>> x = pseudo.generate(type = "ID_Rijksregister", n = 1)
>>> x = pseudo.generate(type = "ID_Rijksregister", n = 100)
>>> x = pseudo.generate(type = "ID_Patient", n = 100)
>>> x = pseudo.generate(type = "ID_Patient", text = ["B731918KN19P", "M061503GU21C"])
>>> x = pseudo.generate(type = "01_Naam", text = ['Jan Janssens', 'Dhr Jan Janssens', 'Mevr Angelina Jolie'])
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'])
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'], gender = "male")
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'], gender = "female")
>>> x = pseudo.generate(type = "01_Naam_Patient", text = ['Jan Janssens', 'Dhr Jan Janssens', 'Mevr Angelina Jolie'])
>>> x = pseudo.generate(type = "02_Adres_Locatie", n = 100)
>>> x = pseudo.generate(type = "02_Adres_Locatie", text = ['Dorpstraat 10, 1090 Brussel', 'Dorpstraat 10\n1090 Brussel', 'DORP 10\n1090 BRUSSEL'])
>>> x = pseudo.generate(type = "05_Beroep", n = 100)
>>> x = pseudo.generate(type = "Beroep", n = 100)
>>> x = pseudo.generate(type = "Organisatie", n = 100)
>>> x = pseudo.generate(type = "medical_title", n = 100)
>>> pseudo = PseudoGenerator(locale = "fr_BE")
>>> x = pseudo.generate(type = "ID_Rijksregister", n = 1)
>>> x = pseudo.generate(type = "ID_Rijksregister", n = 100)
>>> x = pseudo.generate(type = "ID_Patient", n = 100)
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'])
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'], gender = "male")
>>> x = pseudo.generate(type = "01_Naam_Dokter", text = ['Jan Janssens', 'Prof Jan Janssens', 'Dr Jan Janssens'], gender = "female")
>>> x = pseudo.generate(type = "01_Naam_Patient", text = ['Jan Janssens', 'Dhr Jan Janssens', 'Mevr Angelina Jolie'])
>>> x = pseudo.generate(type = "02_Adres_Locatie", n = 100)
>>> x = pseudo.generate(type = "02_Adres_Locatie", text = ['Rue du Village 10, 1090 Bruxelles', 'Rue du Village 10\n1090 Bruxelles', 'VILLAGE 10\n1090 BRUXELLES'])
>>> x = pseudo.generate(type = "05_Beroep", n = 100)
>>> x = pseudo.generate(type = "Beroep", n = 100)
>>> x = pseudo.generate(type = "Organisatie", n = 100)
>>> x = pseudo.generate(type = "medical_title", n = 100)

Utilities to help find suitable replacements

`blackbar.pseudonymisation.is_hospital(x)`

`blackbar.pseudonymisation.is_whitelisted(x, type='name')`

`blackbar.pseudonymisation.is_title_cased(x)`

`blackbar.pseudonymisation.is_upper_cased(x)`

`blackbar.pseudonymisation.is_hour_only(x)`

`blackbar.pseudonymisation.txt_is_numeric_only(x)`

`blackbar.pseudonymisation.txt_is_exact_date(x, remove=None, min_n_slash=1, max_n_slash=2, min_n_backslash=1, max_n_backslash=2, min_n_dot=1, max_n_dot=2, min_n_dash=2, max_n_dash=2)`

`blackbar.pseudonymisation.txt_is_exact_datetime(x)`

`blackbar.pseudonymisation.txt_extract_numbers(x, type='first')`

`blackbar.pseudonymisation.txt_remove_spaces(x, whitespace='\\s+')`

`blackbar.pseudonymisation.txt_contains_item(x, type='elderly_house')`

`blackbar.pseudonymisation.txt_contains_medical_title(x)`

`blackbar.pseudonymisation.txt_contains_formaltitle(x, gender=None)`

`blackbar.pseudonymisation.txt_is_email(x)`

`blackbar.pseudonymisation.txt_is_url(x)`

`blackbar.pseudonymisation.txt_is_phone_number(x)`

`blackbar.pseudonymisation.txt_replace_random_digits(x)`

`blackbar.pseudonymisation.format_date(x, text, faker)`

`blackbar.pseudonymisation.format_datetime(x, text, faker)`

Insert pseudo PII in texts

`blackbar.pseudotexts.txt_mimic_readability(x, y, titleCase=True)`

`blackbar.pseudotexts.pseudo_replacements(x, pseudonimizer, dateshift={'years': -3, 'months': -3, 'days': -167, 'weeks': 4, 'days': 23}, failure_strategy=None, model=None)`

`blackbar.utils.txt_insert(x, replacement, start, end=None, reverse=True)`

Insert another text in a string

Parameters:

Name	Type	Description	Default
`x`	`str`	A text string	required
`replacement`	`Union[str, List[str]]`	A text replacement	required
`start`	`Union[int, List[int]]`	start position in x where to put the replacement	required
`end`	`Union[int, List[int], None]`	end position in x where to end the insertion of the replacement	`None`
`reverse`	`bool`	logical indicating to do the replacement starting from the back. Defaults to True.	`True`

Returns:

Type	Description
	x where the specified sections are replaced

Examples:

>>> from blackbar import *
>>> x = 'Kung Warrior'
>>> txt_insert(x, replacement = 'Fu ', start = 5)
'Kung Fu Warrior'
>>> x = 'Kung ___ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 3)
'Kung Fu Warrior'
>>> x = 'Kung _ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 1)
'Kung Fu Warrior'
>>> x = 'My name is _NAME_ and I work in _LOC_. I am _AGE_ years old.'
>>> txt_insert(x, replacement = ['Jos', 'Dorpstraat 40, 1000 Brussel', '43'], start = [11, 32, 44], end = [16+1, 36+1, 48+1])
'My name is Jos and I work in Dorpstraat 40, 1000 Brussel. I am 43 years old.'
>>> #import re     
>>> #[loc.start() for loc in re.finditer('_', x)]

`blackbar.utils.txt_sample(x, n=1)`

Sample from a list

Parameters:

Name	Type	Description	Default
`x`	`List`	A str	required

Returns:

Type	Description
`List`	A list with a sample of n elements from the List x

Examples:

>>> from blackbar import *
>>> x = txt_sample(['a', 'b', 'c'], n = 2)