Skip to content

Utilities

Custom functionalities for specific hospitals

blackbar.uzb.uzb_identify_chunks(x, type='patientid')

Identify with regular expressions if a text contains either a patientid, a date, rijksregister

Parameters:

Name Type Description Default
x str

a text

required

Returns:

Type Description
bool

bool indicating if the type is found

Examples:

>>> from blackbar import uzb_identify_chunks
>>> x = uzb_identify_chunks('hello world A930523DR00L/RVdV ok works', type = 'patientid')
>>> x = uzb_identify_chunks('hello world 123456789', type = 'patientid')
>>> x = uzb_identify_chunks('hello world 30/12/1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 30-12-1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 30.12.1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 1978.12.30 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 78.12.30-014.53 ok works', type = 'rijksregister')
>>> x = uzb_identify_chunks('hello world 78.12.30-014 not ok', type = 'rijksregister')

blackbar.uzb.uzb_harmonize_physician(x, flags=re.IGNORECASE)

Harmonise names of physicians in the database by removing the Prof/Dr/Mevr/Mej/AP prefixes people put before their names Functionality for harmonising names: removing prof dr ... and (123456789)

Parameters:

Name Type Description Default
x str

a text

required

Returns:

Type Description

the str where prefixes are removed

Examples:

>>> from blackbar import uzb_harmonize_physician
>>> uzb_harmonize_physician('PROF.DR. Janssens Jan')
'Janssens Jan'
>>> uzb_harmonize_physician('    MEVR. Linda Wittevrongel')
'Linda Wittevrongel'

blackbar.uzb.uzb_vn_achternaam(x, collapse=' ')

Combine surname + family name. Where the first letter of the surname is taken only. E.g. ["JAN", "JANSSENS"] becomes "J. JANSSENS"

Parameters:

Name Type Description Default
x List[str]

a list with 2 elements: surname & family name

required
collapse str

string how to collapse the surname & family name together

' '

Returns:

Type Description

the str where prefixes are removed

Examples:

>>> from blackbar import uzb_harmonize_physician
>>> x = ["JAN", "JANSSENS"]
>>> uzb_vn_achternaam(x)
'J. JANSSENS'
>>> x = ["Linda", "Wittevrongel"]
>>> uzb_vn_achternaam(x)
'L. Wittevrongel'

blackbar.uzb.uzb_txt_contains(x, type)

Check based on predefined regular expressions if a string contains a date, a datesymbol, a day of the week, an age, an hour or a streetindication TODO: maandag wordt nog gedetecteerd als maand TODO: 01 januari wordt nog gedetecteerd als age ook

Parameters:

Name Type Description Default
x str

a text

required
type str

the type of element to look for. Possible values are 'date', 'datesymbol', 'dow', 'age', 'hour', 'streetindication'

required

Returns:

Type Description
bool

a boolean indicating the pattern has been found

Examples:

>>> from blackbar import uzb_txt_contains
>>> text = 'Op maandag 01 januari 2022 15:15 kwam u langs. Het bleek een mooie dag te zijn'
>>> text = 'Op dinsdag 01 februari 2022 15:15 kwam u langs. Het bleek een mooie dag te zijn'
>>> uzb_txt_contains(text, type = 'date')
True
>>> uzb_txt_contains(text, type = 'datesymbol')
False
>>> uzb_txt_contains(text, type = 'dow')
True
>>> uzb_txt_contains(text, type = 'age')
False
>>> uzb_txt_contains(text, type = 'hour')
True
>>> uzb_txt_contains(text, type = 'streetindication')
False

blackbar.uzb.uzb_detect_smith_waterman(data, log=False, alignment_method='default')

Extract names/addresses based on Smith Waterman Either for patient names or for names of physicians that the patient got into contact with.

Parameters:

Name Type Description Default
data Series

the selction on 1 row from a pandas dataframe with the combination of enriched patient identifiers as extracted with deid_enrich_identifiers(type = 'patients') and identifiers of which physicians the patient got into contact with as extracted with deid_enrich_identifiers(type = 'patients_physicians')

required
log bool

bool indicating to print the information in case of failure of the alignment

False
alignment_method str

either 'default' or 'biopython', passed on to Smith_Waterman

'default'

Returns:

Type Description

A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similariy, start, end, length with the detected entities and the exact positions in the text

Examples:

>>> from blackbar import deid_enrich_identifiers, blackbar_example, deid_smith_waterman
>>> from rlike import *
>>> x = blackbar_example('patients_physicians')
>>> pats = deid_enrich_identifiers(x, type = "patients_physicians")
>>> docs = blackbar_example('documents')
>>> docs = deid_enrich_identifiers(docs, type = "patients")
>>> docs = docs.merge(pats, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "default")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "biopython")
>>> sw = deid_smith_waterman(docs.iloc[1], alignment_method = "biopython")
>>> data = docs.iloc[0]
>>> ents = deid_smith_waterman(data, alignment_method = "biopython")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0
>>> ents = deid_smith_waterman(data, alignment_method = "default")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0

blackbar.uzb.uzb_enrich_identifiers(data, type='patients')

Create combinations of names/address identifiers which can be use to lookup using Smith-Waterman. Either for patient names or for names of physicians that the patient got into contact with.

Parameters:

Name Type Description Default
data DataFrame

a pandas dataframe with the following fields If type is 'patients_physicians': pat_dos_nr, hcact_vnaam, hcact_fam_naam, hcact_adres hcact_post_nr, hcact_gemeente If type is 'patients': performingPhysicianId, responsiblePhysicianId and pat_vnaam, pat_fam_naam, pat_adres, pat_post_nr, pat_gemeente

required
type str

either 'patients' or 'patients_physicians'

'patients'

Returns:

Type Description

If type is 'patients_physicians' returns: - a pandas dataframe with 1 row per patient with columns pat_dos_nr, physician_adres, physician_naam_kort, physician_naam_lang - the columns physician_adres, physician_naam_kort, physician_naam_lang indicate contain a list of all possible physicians the patient got into contact with, their address and different ways of writing these - the different ways are constructed by constructing an address - hcact_adres + hcact_post_nr + hcact_gemeente - hcact_post_nr + hcact_gemeente - the different ways are constructed by constructing a name (note only first name: hcact_vnaam is not used) - hcact_fam_naam - hcact_vnaam + hcact_fam_naam - hcact_fam_naam + hcact_vnaam - first letter of hcact_vnaam. + hcact_fam_naam (e.g. J. Janssens)

If type is 'patients' returns the pandas DataFrame data with the following extra columns added/changed - Physician info - performingPhysicianId, responsiblePhysicianId: harmonising by removing 'Prof/Dr/Mevr/Mej/AP' prefixes people put before their names - performing_dr_anvn, performing_dr_vnan, performing_dr_vnanafk, performing_dr_an, performing_dr_vn (combining first name / last name based on information in performingPhysicianId) - responsible_dr_anvn, responsible_dr_vnan, responsible_dr_vnanafk, responsible_dr_an, responsible_dr_vn (combining first name / last name based on information in responsiblePhysicianId) - Patient info: - pat_anvn, pat_vnan, pat_vnanafk, pat_vnaam, pat_fam_naam (first name / last name combinations and J. Janssens) - pat_adresgegevens, pat_postcode_gemeente, pat_adres

Examples:

>>> from blackbar import deid_enrich_identifiers, PseudoGenerator, blackbar_example
>>> import pandas as pd
>>> pseudo = PseudoGenerator()
>>> x = pd.DataFrame({"pat_dos_nr": pseudo.generate(type = "ID_Patient", n = 1), "hcact_vnaam": ["Jan", "Piet"], "hcact_fam_naam": ["Janssens", "Peeters"], "hcact_adres": ["Stormy Daneelsstraat 125", "Stationsstraat 321"], "hcact_post_nr": ["1000", "1090"], "hcact_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('patients_physicians')
>>> d = deid_enrich_identifiers(x, type = "patients_physicians")
>>> d = d[d["pat_dos_nr"] == "Y871129di17X"].reset_index()
>>> d["physician_naam_lang"][0]
['J. Janssens', 'P. Peeters', 'Janssens Jan', 'Peeters Piet', 'Jan Janssens', 'Piet Peeters']
>>> d["physician_adres"][0]
['Stormy Daneelsstraat 125', 'Stationsstraat 321', 'Stormy Daneelsstraat 125 1000 Brussel', 'Stationsstraat 321 1090 Jette', '1000 Brussel', '1090 Jette']
>>> x = pd.DataFrame({"patientId": pseudo.generate(type = "ID_Patient", n = 1), "performingPhysicianId": ["PROF.DR. JANSSENS, JAN", "DR. PEETERS, PIET"], "responsiblePhysicianId": [None, "PROF.DR. JANSSENS, JAN"], "pat_vnaam": ["Mehdi", "Jos"], "pat_fam_naam": ["Olek", "Vermeulen"], "pat_adres": ["Grote Beek 9", "Laarbeeklaan 99"], "pat_post_nr": ["1000", "1090"], "pat_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('documents')
>>> d = deid_enrich_identifiers(x, type = "patients")
>>> d = d[d["patientId"] == "Y871129di17X"].reset_index()
>>> d["performingPhysicianId"][0]
'JANSSENS, JAN'
>>> d["pat_anvn"][0]
'Olek Mehdi'
>>> d["pat_vnanafk"][0]
'M. Olek'
>>> d["pat_adresgegevens"][0]
'Grote Beek 9 1000 Brussel'
>>> d["pat_postcode_gemeente"][0]
'1000 Brussel'

blackbar.uzb.combine_chunkranges(chunks_sw, chunks_nlp, text)

Chunks / spans

blackbar.blackbar.merge_chunkranges(data, text, include_end=False)

Combine overlapping chunk ranges by finding overlapping ranges. Overlapping ranges are handled as follows. We first order the ranges by the starting positions. If the start value of the chunk falls within the previous chunk range (smaller than the previous end value - it overlaps) we extend the previous end value with the new end value and keep the longest search string (term_search) as we assume the detected chunk ranges are coming from a Smith-Waterman alignment.

Parameters:

Name Type Description Default
data DataFrame

A data frame with columns doc_id, label_, term_search, start, end, similarity indicating the location of detected entities, where the end value is the end position in the text + 1

required
text str

a text string based on which the entities in data were found

required
include_end bool

bool indicating to include the end

False

Returns:

Type Description

A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length indicating

Examples:

>>> from pandas import DataFrame
>>> from rlike import substr
>>> from blackbar.data import blackbar_example
>>> text = blackbar_example()
>>> ent = {'doc_id': text, 'similarity': 0, 'term_search': ['A', 'AA', 'B', 'B', 'C', 'CC', 'D', 'DD'], 'start': [10, 10+5, 93, 242, 343 + 4, 343, 384, 384+4], 'end': [19, 19, 98, 247, 343 + 7, 359, 400, 400+2], 'label_': ['Date', 'Date', 'Date', 'Date', 'Name', 'Name', 'Name', 'Name']} 
>>> ent = DataFrame(ent)
>>> substr(ent["doc_id"], start = list(ent["start"]), end = list(ent["end"] - 1))
['10/4/2021', '2021', '30/03', '09/04', 'Jan', 'Dr. Jan Janssens', 'Dr. Jan Janssens', 'Jan Janssens |']
>>> chunkranges = merge_chunkranges(ent, text)
>>> list(chunkranges["term_detected"])
['10/4/2021', '30/03', '09/04', 'Dr. Jan Janssens', 'Dr. Jan Janssens |']

General

blackbar.utils.na_exclude(x)

Exclude elements with missing data from a list-like array

Parameters:

Name Type Description Default
x

A list

required

Returns:

Type Description

A list where missing data are removed

Examples:

>>> from blackbar import *
>>> import pandas as pd
>>> import numpy as np
>>> x = [1, np.nan, 3, 4, pd.NA, 6, None, 99]
>>> na_exclude(x) 
[1, 3, 4, 6, 99]

blackbar.utils.chunk(x, n=2)

Split a list in n chunks

Parameters:

Name Type Description Default
x List

A list

required
n int

integer with the number of chunks. Defaults to 2

2

Returns:

Type Description

yields an iterator over the list in chunks of size n

Examples:

>>> from blackbar import *
>>> x = range(14)
>>> x = list(x)
>>> it = chunk(x, n = 2)
>>> next(it)
[0, 2, 4, 6, 8, 10, 12]
>>> next(it)
[1, 3, 5, 7, 9, 11, 13]

General text features

blackbar.utils.txt_n_capital(x)

Count the number of capitalised letters

Parameters:

Name Type Description Default
x str

A str

required

Returns:

Type Description
int

integer with the number of capitalised letters in x

Examples:

>>> from blackbar import *
>>> txt_n_capital('Hello World')
2

blackbar.utils.txt_n_newlines(x)

Count the number of newlines

Parameters:

Name Type Description Default
x str

A str

required

Returns:

Type Description
int

integer with the number of newlines in x

Examples:

>>> from blackbar import *
>>> txt_n_newlines('Hello World')
0
>>> txt_n_newlines(['Hello World', None])
[0, None]

blackbar.utils.txt_contains_lot_of_capitals(x, threshold=0.5, min_length=2)

Test if a string contains a lot of capitals by looking at how much of the string is in capital and a minimum length of the string

Parameters:

Name Type Description Default
x str

A str

required
threshold float

Percent of the number of letters which should have a capital

0.5
min_length int

Minimum length of the number of letters in x

2

Returns:

Type Description
int

bool indicating if x contains a lot of letters in capital case

Examples:

>>> from blackbar import *
>>> x = 'HELLO There'
>>> txt_contains_lot_of_capitals('HELLO There', threshold = 0.9)
False
>>> txt_contains_lot_of_capitals('HELLO There', threshold = 0.5)
True
>>> pct = txt_n_capital(x) / len(x)

blackbar.utils.txt_contains(x, pattern, is_regex=False, flags=re.IGNORECASE)

Check if a string contains a certain pattern

Parameters:

Name Type Description Default
x str

A str

required
pattern str

the pattern to look up in x

required
is_regex bool

boolean indicating if pattern is a regular expression

False
flags

passed on to the flags argument of re.search in case is_regex is True

IGNORECASE

Returns:

Type Description
bool

bool indicating if the string x contains the pattern

Examples:

>>> from blackbar import *
>>> txt_contains('Hello World', pattern = 'wo', is_regex = True)
True
>>> txt_contains('Hello World', pattern = 'wo', is_regex = False)
False
>>> txt_contains('Hello World', pattern = 'World')
True

blackbar.utils.txt_leading_trailing(x, type='leading')

Gets the leading or trailing spaces

Parameters:

Name Type Description Default
x Union[str, List[str]]

a str, a list or an list-like object with text

required
type str

either 'leading' or 'trailing'

'leading'

Returns:

Type Description

a list of the same length as x or a str

Examples:

>>> from blackbar import *
>>> x = ' \n Hello world   \r\n '
>>> txt_leading_trailing(x, type = 'leading')
' \n '
>>> txt_leading_trailing(x, type = 'trailing')
'   \r\n '
>>> x = ['  Hello world    ', '  ABCDEF']
>>> txt_leading_trailing(x, type = 'leading')
['  ', '  ']
>>> txt_leading_trailing(x, type = 'trailing')
['    ', '']

blackbar.utils.txt_trailing_spaces(x)

General text processing

blackbar.utils.txt_sample(x, n=1)

Sample from a list

Parameters:

Name Type Description Default
x List

A str

required

Returns:

Type Description
List

A list with a sample of n elements from the List x

Examples:

>>> from blackbar import *
>>> x = txt_sample(['a', 'b', 'c'], n = 2)

blackbar.utils.txt_paste(*lists, sep=' ', collapse=None)

Paste text together while removing None values

Parameters:

Name Type Description Default
lists str | List[str]

A str or a list of str

()
collapse Union[str, None]

A str indicating how to collapse a list together

None
sep str

A str indicating how to paste several elements together

' '

Returns:

Type Description

A list

Examples:

>>> from blackbar import *
>>> txt_paste(["a", "b", "c"], collapse = ", ") 
'a, b, c'
>>> txt_paste(["a", None, "b"], collapse = ", ") 
'a, b'
>>> txt_paste(["a", "b", "c"], ["1", "2", "3"], sep = "-") 
['a-1', 'b-2', 'c-3']
>>> txt_paste(["a", "b", None], ["1", "2", "3"], sep = "-")
['a-1', 'b-2', '3']

blackbar.utils.txt_insert(x, replacement, start, end=None, reverse=True)

Insert another text in a string

Parameters:

Name Type Description Default
x str

A text string

required
replacement Union[str, List[str]]

A text replacement

required
start Union[int, List[int]]

start position in x where to put the replacement

required
end Union[int, List[int], None]

end position in x where to end the insertion of the replacement

None
reverse bool

logical indicating to do the replacement starting from the back. Defaults to True.

True

Returns:

Type Description

x where the specified sections are replaced

Examples:

>>> from blackbar import *
>>> x = 'Kung Warrior'
>>> txt_insert(x, replacement = 'Fu ', start = 5)
'Kung Fu Warrior'
>>> x = 'Kung ___ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 3)
'Kung Fu Warrior'
>>> x = 'Kung _ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 1)
'Kung Fu Warrior'
>>> x = 'My name is _NAME_ and I work in _LOC_. I am _AGE_ years old.'
>>> txt_insert(x, replacement = ['Jos', 'Dorpstraat 40, 1000 Brussel', '43'], start = [11, 32, 44], end = [16+1, 36+1, 48+1])
'My name is Jos and I work in Dorpstraat 40, 1000 Brussel. I am 43 years old.'
>>> #import re     
>>> #[loc.start() for loc in re.finditer('_', x)]

blackbar.utils.txt_freq(x, sort=True)

Univariate frequencies

Parameters:

Name Type Description Default
x list

a list or an list-like object

required
sort bool

boolean indicating to sort by frequency. Defaults to true.

True

Returns:

Type Description

a pandas DataFrame with columns key, freq, freq_pct indicating the frequencies of x

Examples:

>>> from blackbar import *
>>> x = ["a", "b", "b", "a", "b"]
>>> freq = txt_freq(x)

Text cleaning

blackbar.utils.txt_clean_word2vec(text, ascii=True, lower=True)

Clean text (ascii/lowercasing) and split into words to be able to apply word2vec. Words are just extracted by splitting based in punctuation symbols and stripping text with leading/trailing spaces.

Parameters:

Name Type Description Default
text str

A str

required

Returns:

Type Description
List[str]

a List[str] where the text is converted to ASCII and lowercased

Examples:

>>> from blackbar import *
>>> text = u'Dziennik zak‚óceƒ'
>>> txt_clean_word2vec(text)
['dziennik', 'zakoce']
>>> text = 'González, M.'
>>> txt_clean_word2vec(text)
['gonzalez', 'm']

blackbar.utils.ascii_translit(text)

Convert text to ASCII

Parameters:

Name Type Description Default
text str

A str

required

Returns:

Type Description
str

a str where the text is converted to ASCII

Examples:

>>> from blackbar import *
>>> text = u'Dziennik zak‚óceƒ'
>>> ascii_translit(text)
'Dziennik zakoce'
>>> text = 'González, M.'
>>> ascii_translit(text)
'Gonzalez, M.'
>>> ascii_translit('éêöà')
'eeoa'

Tokenization

blackbar.utils.tokenize_letters(x)

Splits text according into a list of letters

Parameters:

Name Type Description Default
x str

A str

required

Returns:

Type Description
List[str]

a List[str]

Examples:

>>> from blackbar import *
>>> tokenize_letters('Hello World')
['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

blackbar.utils.tokenize_spaces_punct(x)

Splits text according based on spaces and punctuations

Parameters:

Name Type Description Default
x str

A str

required

Returns:

Type Description
List[str]

a List[str]

Examples:

>>> from blackbar import *
>>> tokenize_spaces_punct('Hello World. You want more?')
['Hello', 'World', 'You', 'want', 'more']

blackbar.utils.tokenize_lines(x)

Splits text in lines

Parameters:

Name Type Description Default
x str

A str

required

Returns:

Type Description
List[str]

a List[str]

Examples:

>>> from blackbar import *
>>> tokenize_lines('Hello World.\nYou want more? \n\nYeah baby!')
['Hello World.', 'You want more? ', 'Yeah baby!']