Utilities
Custom functionalities for specific hospitals
blackbar.uzb.uzb_identify_chunks(x, type='patientid')
Identify with regular expressions if a text contains either a patientid, a date, rijksregister
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
a text |
required |
Returns:
Type | Description |
---|---|
bool
|
bool indicating if the type is found |
Examples:
>>> from blackbar import uzb_identify_chunks
>>> x = uzb_identify_chunks('hello world A930523DR00L/RVdV ok works', type = 'patientid')
>>> x = uzb_identify_chunks('hello world 123456789', type = 'patientid')
>>> x = uzb_identify_chunks('hello world 30/12/1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 30-12-1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 30.12.1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 1978.12.30 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 78.12.30-014.53 ok works', type = 'rijksregister')
>>> x = uzb_identify_chunks('hello world 78.12.30-014 not ok', type = 'rijksregister')
blackbar.uzb.uzb_harmonize_physician(x, flags=re.IGNORECASE)
Harmonise names of physicians in the database by removing the Prof/Dr/Mevr/Mej/AP prefixes people put before their names Functionality for harmonising names: removing prof dr ... and (123456789)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
a text |
required |
Returns:
Type | Description |
---|---|
the str where prefixes are removed |
Examples:
>>> from blackbar import uzb_harmonize_physician
>>> uzb_harmonize_physician('PROF.DR. Janssens Jan')
'Janssens Jan'
>>> uzb_harmonize_physician(' MEVR. Linda Wittevrongel')
'Linda Wittevrongel'
blackbar.uzb.uzb_vn_achternaam(x, collapse=' ')
Combine surname + family name. Where the first letter of the surname is taken only. E.g. ["JAN", "JANSSENS"] becomes "J. JANSSENS"
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
List[str]
|
a list with 2 elements: surname & family name |
required |
collapse
|
str
|
string how to collapse the surname & family name together |
' '
|
Returns:
Type | Description |
---|---|
the str where prefixes are removed |
Examples:
>>> from blackbar import uzb_harmonize_physician
>>> x = ["JAN", "JANSSENS"]
>>> uzb_vn_achternaam(x)
'J. JANSSENS'
>>> x = ["Linda", "Wittevrongel"]
>>> uzb_vn_achternaam(x)
'L. Wittevrongel'
blackbar.uzb.uzb_txt_contains(x, type)
Check based on predefined regular expressions if a string contains a date, a datesymbol, a day of the week, an age, an hour or a streetindication TODO: maandag wordt nog gedetecteerd als maand TODO: 01 januari wordt nog gedetecteerd als age ook
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
a text |
required |
type
|
str
|
the type of element to look for. Possible values are 'date', 'datesymbol', 'dow', 'age', 'hour', 'streetindication' |
required |
Returns:
Type | Description |
---|---|
bool
|
a boolean indicating the pattern has been found |
Examples:
>>> from blackbar import uzb_txt_contains
>>> text = 'Op maandag 01 januari 2022 15:15 kwam u langs. Het bleek een mooie dag te zijn'
>>> text = 'Op dinsdag 01 februari 2022 15:15 kwam u langs. Het bleek een mooie dag te zijn'
>>> uzb_txt_contains(text, type = 'date')
True
>>> uzb_txt_contains(text, type = 'datesymbol')
False
>>> uzb_txt_contains(text, type = 'dow')
True
>>> uzb_txt_contains(text, type = 'age')
False
>>> uzb_txt_contains(text, type = 'hour')
True
>>> uzb_txt_contains(text, type = 'streetindication')
False
blackbar.uzb.uzb_detect_smith_waterman(data, log=False, alignment_method='default')
Extract names/addresses based on Smith Waterman Either for patient names or for names of physicians that the patient got into contact with.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Series
|
the selction on 1 row from a pandas dataframe with the combination of enriched patient identifiers as extracted with deid_enrich_identifiers(type = 'patients') and identifiers of which physicians the patient got into contact with as extracted with deid_enrich_identifiers(type = 'patients_physicians') |
required |
log
|
bool
|
bool indicating to print the information in case of failure of the alignment |
False
|
alignment_method
|
str
|
either 'default' or 'biopython', passed on to Smith_Waterman |
'default'
|
Returns:
Type | Description |
---|---|
A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similariy, start, end, length with the detected entities and the exact positions in the text |
Examples:
>>> from blackbar import deid_enrich_identifiers, blackbar_example, deid_smith_waterman
>>> from rlike import *
>>> x = blackbar_example('patients_physicians')
>>> pats = deid_enrich_identifiers(x, type = "patients_physicians")
>>> docs = blackbar_example('documents')
>>> docs = deid_enrich_identifiers(docs, type = "patients")
>>> docs = docs.merge(pats, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "default")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "biopython")
>>> sw = deid_smith_waterman(docs.iloc[1], alignment_method = "biopython")
>>> data = docs.iloc[0]
>>> ents = deid_smith_waterman(data, alignment_method = "biopython")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0
>>> ents = deid_smith_waterman(data, alignment_method = "default")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0
blackbar.uzb.uzb_enrich_identifiers(data, type='patients')
Create combinations of names/address identifiers which can be use to lookup using Smith-Waterman. Either for patient names or for names of physicians that the patient got into contact with.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame
|
a pandas dataframe with the following fields If type is 'patients_physicians': pat_dos_nr, hcact_vnaam, hcact_fam_naam, hcact_adres hcact_post_nr, hcact_gemeente If type is 'patients': performingPhysicianId, responsiblePhysicianId and pat_vnaam, pat_fam_naam, pat_adres, pat_post_nr, pat_gemeente |
required |
type
|
str
|
either 'patients' or 'patients_physicians' |
'patients'
|
Returns:
Type | Description |
---|---|
If type is 'patients_physicians' returns: - a pandas dataframe with 1 row per patient with columns pat_dos_nr, physician_adres, physician_naam_kort, physician_naam_lang - the columns physician_adres, physician_naam_kort, physician_naam_lang indicate contain a list of all possible physicians the patient got into contact with, their address and different ways of writing these - the different ways are constructed by constructing an address - hcact_adres + hcact_post_nr + hcact_gemeente - hcact_post_nr + hcact_gemeente - the different ways are constructed by constructing a name (note only first name: hcact_vnaam is not used) - hcact_fam_naam - hcact_vnaam + hcact_fam_naam - hcact_fam_naam + hcact_vnaam - first letter of hcact_vnaam. + hcact_fam_naam (e.g. J. Janssens) |
|
If type is 'patients' returns the pandas DataFrame data with the following extra columns added/changed - Physician info - performingPhysicianId, responsiblePhysicianId: harmonising by removing 'Prof/Dr/Mevr/Mej/AP' prefixes people put before their names - performing_dr_anvn, performing_dr_vnan, performing_dr_vnanafk, performing_dr_an, performing_dr_vn (combining first name / last name based on information in performingPhysicianId) - responsible_dr_anvn, responsible_dr_vnan, responsible_dr_vnanafk, responsible_dr_an, responsible_dr_vn (combining first name / last name based on information in responsiblePhysicianId) - Patient info: - pat_anvn, pat_vnan, pat_vnanafk, pat_vnaam, pat_fam_naam (first name / last name combinations and J. Janssens) - pat_adresgegevens, pat_postcode_gemeente, pat_adres |
Examples:
>>> from blackbar import deid_enrich_identifiers, PseudoGenerator, blackbar_example
>>> import pandas as pd
>>> pseudo = PseudoGenerator()
>>> x = pd.DataFrame({"pat_dos_nr": pseudo.generate(type = "ID_Patient", n = 1), "hcact_vnaam": ["Jan", "Piet"], "hcact_fam_naam": ["Janssens", "Peeters"], "hcact_adres": ["Stormy Daneelsstraat 125", "Stationsstraat 321"], "hcact_post_nr": ["1000", "1090"], "hcact_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('patients_physicians')
>>> d = deid_enrich_identifiers(x, type = "patients_physicians")
>>> d = d[d["pat_dos_nr"] == "Y871129di17X"].reset_index()
>>> d["physician_naam_lang"][0]
['J. Janssens', 'P. Peeters', 'Janssens Jan', 'Peeters Piet', 'Jan Janssens', 'Piet Peeters']
>>> d["physician_adres"][0]
['Stormy Daneelsstraat 125', 'Stationsstraat 321', 'Stormy Daneelsstraat 125 1000 Brussel', 'Stationsstraat 321 1090 Jette', '1000 Brussel', '1090 Jette']
>>> x = pd.DataFrame({"patientId": pseudo.generate(type = "ID_Patient", n = 1), "performingPhysicianId": ["PROF.DR. JANSSENS, JAN", "DR. PEETERS, PIET"], "responsiblePhysicianId": [None, "PROF.DR. JANSSENS, JAN"], "pat_vnaam": ["Mehdi", "Jos"], "pat_fam_naam": ["Olek", "Vermeulen"], "pat_adres": ["Grote Beek 9", "Laarbeeklaan 99"], "pat_post_nr": ["1000", "1090"], "pat_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('documents')
>>> d = deid_enrich_identifiers(x, type = "patients")
>>> d = d[d["patientId"] == "Y871129di17X"].reset_index()
>>> d["performingPhysicianId"][0]
'JANSSENS, JAN'
>>> d["pat_anvn"][0]
'Olek Mehdi'
>>> d["pat_vnanafk"][0]
'M. Olek'
>>> d["pat_adresgegevens"][0]
'Grote Beek 9 1000 Brussel'
>>> d["pat_postcode_gemeente"][0]
'1000 Brussel'
blackbar.uzb.combine_chunkranges(chunks_sw, chunks_nlp, text)
Chunks / spans
blackbar.blackbar.merge_chunkranges(data, text, include_end=False)
Combine overlapping chunk ranges by finding overlapping ranges. Overlapping ranges are handled as follows. We first order the ranges by the starting positions. If the start value of the chunk falls within the previous chunk range (smaller than the previous end value - it overlaps) we extend the previous end value with the new end value and keep the longest search string (term_search) as we assume the detected chunk ranges are coming from a Smith-Waterman alignment.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
DataFrame
|
A data frame with columns doc_id, label_, term_search, start, end, similarity indicating the location of detected entities, where the end value is the end position in the text + 1 |
required |
text
|
str
|
a text string based on which the entities in data were found |
required |
include_end
|
bool
|
bool indicating to include the end |
False
|
Returns:
Type | Description |
---|---|
A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length indicating |
Examples:
>>> from pandas import DataFrame
>>> from rlike import substr
>>> from blackbar.data import blackbar_example
>>> text = blackbar_example()
>>> ent = {'doc_id': text, 'similarity': 0, 'term_search': ['A', 'AA', 'B', 'B', 'C', 'CC', 'D', 'DD'], 'start': [10, 10+5, 93, 242, 343 + 4, 343, 384, 384+4], 'end': [19, 19, 98, 247, 343 + 7, 359, 400, 400+2], 'label_': ['Date', 'Date', 'Date', 'Date', 'Name', 'Name', 'Name', 'Name']}
>>> ent = DataFrame(ent)
>>> substr(ent["doc_id"], start = list(ent["start"]), end = list(ent["end"] - 1))
['10/4/2021', '2021', '30/03', '09/04', 'Jan', 'Dr. Jan Janssens', 'Dr. Jan Janssens', 'Jan Janssens |']
>>> chunkranges = merge_chunkranges(ent, text)
>>> list(chunkranges["term_detected"])
['10/4/2021', '30/03', '09/04', 'Dr. Jan Janssens', 'Dr. Jan Janssens |']
General
blackbar.utils.na_exclude(x)
Exclude elements with missing data from a list-like array
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
A list |
required |
Returns:
Type | Description |
---|---|
A list where missing data are removed |
Examples:
>>> from blackbar import *
>>> import pandas as pd
>>> import numpy as np
>>> x = [1, np.nan, 3, 4, pd.NA, 6, None, 99]
>>> na_exclude(x)
[1, 3, 4, 6, 99]
blackbar.utils.chunk(x, n=2)
Split a list in n chunks
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
List
|
A list |
required |
n
|
int
|
integer with the number of chunks. Defaults to 2 |
2
|
Returns:
Type | Description |
---|---|
yields an iterator over the list in chunks of size n |
Examples:
>>> from blackbar import *
>>> x = range(14)
>>> x = list(x)
>>> it = chunk(x, n = 2)
>>> next(it)
[0, 2, 4, 6, 8, 10, 12]
>>> next(it)
[1, 3, 5, 7, 9, 11, 13]
General text features
blackbar.utils.txt_n_capital(x)
Count the number of capitalised letters
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A str |
required |
Returns:
Type | Description |
---|---|
int
|
integer with the number of capitalised letters in x |
Examples:
>>> from blackbar import *
>>> txt_n_capital('Hello World')
2
blackbar.utils.txt_n_newlines(x)
Count the number of newlines
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A str |
required |
Returns:
Type | Description |
---|---|
int
|
integer with the number of newlines in x |
Examples:
>>> from blackbar import *
>>> txt_n_newlines('Hello World')
0
>>> txt_n_newlines(['Hello World', None])
[0, None]
blackbar.utils.txt_contains_lot_of_capitals(x, threshold=0.5, min_length=2)
Test if a string contains a lot of capitals by looking at how much of the string is in capital and a minimum length of the string
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A str |
required |
threshold
|
float
|
Percent of the number of letters which should have a capital |
0.5
|
min_length
|
int
|
Minimum length of the number of letters in x |
2
|
Returns:
Type | Description |
---|---|
int
|
bool indicating if x contains a lot of letters in capital case |
Examples:
>>> from blackbar import *
>>> x = 'HELLO There'
>>> txt_contains_lot_of_capitals('HELLO There', threshold = 0.9)
False
>>> txt_contains_lot_of_capitals('HELLO There', threshold = 0.5)
True
>>> pct = txt_n_capital(x) / len(x)
blackbar.utils.txt_contains(x, pattern, is_regex=False, flags=re.IGNORECASE)
Check if a string contains a certain pattern
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A str |
required |
pattern
|
str
|
the pattern to look up in x |
required |
is_regex
|
bool
|
boolean indicating if pattern is a regular expression |
False
|
flags
|
passed on to the flags argument of re.search in case is_regex is True |
IGNORECASE
|
Returns:
Type | Description |
---|---|
bool
|
bool indicating if the string x contains the pattern |
Examples:
>>> from blackbar import *
>>> txt_contains('Hello World', pattern = 'wo', is_regex = True)
True
>>> txt_contains('Hello World', pattern = 'wo', is_regex = False)
False
>>> txt_contains('Hello World', pattern = 'World')
True
blackbar.utils.txt_leading_trailing(x, type='leading')
Gets the leading or trailing spaces
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Union[str, List[str]]
|
a str, a list or an list-like object with text |
required |
type
|
str
|
either 'leading' or 'trailing' |
'leading'
|
Returns:
Type | Description |
---|---|
a list of the same length as x or a str |
Examples:
>>> from blackbar import *
>>> x = ' \n Hello world \r\n '
>>> txt_leading_trailing(x, type = 'leading')
' \n '
>>> txt_leading_trailing(x, type = 'trailing')
' \r\n '
>>> x = [' Hello world ', ' ABCDEF']
>>> txt_leading_trailing(x, type = 'leading')
[' ', ' ']
>>> txt_leading_trailing(x, type = 'trailing')
[' ', '']
blackbar.utils.txt_trailing_spaces(x)
General text processing
blackbar.utils.txt_sample(x, n=1)
Sample from a list
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
List
|
A str |
required |
Returns:
Type | Description |
---|---|
List
|
A list with a sample of n elements from the List x |
Examples:
>>> from blackbar import *
>>> x = txt_sample(['a', 'b', 'c'], n = 2)
blackbar.utils.txt_paste(*lists, sep=' ', collapse=None)
Paste text together while removing None values
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lists
|
str | List[str]
|
A str or a list of str |
()
|
collapse
|
Union[str, None]
|
A str indicating how to collapse a list together |
None
|
sep
|
str
|
A str indicating how to paste several elements together |
' '
|
Returns:
Type | Description |
---|---|
A list |
Examples:
>>> from blackbar import *
>>> txt_paste(["a", "b", "c"], collapse = ", ")
'a, b, c'
>>> txt_paste(["a", None, "b"], collapse = ", ")
'a, b'
>>> txt_paste(["a", "b", "c"], ["1", "2", "3"], sep = "-")
['a-1', 'b-2', 'c-3']
>>> txt_paste(["a", "b", None], ["1", "2", "3"], sep = "-")
['a-1', 'b-2', '3']
blackbar.utils.txt_insert(x, replacement, start, end=None, reverse=True)
Insert another text in a string
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A text string |
required |
replacement
|
Union[str, List[str]]
|
A text replacement |
required |
start
|
Union[int, List[int]]
|
start position in x where to put the replacement |
required |
end
|
Union[int, List[int], None]
|
end position in x where to end the insertion of the replacement |
None
|
reverse
|
bool
|
logical indicating to do the replacement starting from the back. Defaults to True. |
True
|
Returns:
Type | Description |
---|---|
x where the specified sections are replaced |
Examples:
>>> from blackbar import *
>>> x = 'Kung Warrior'
>>> txt_insert(x, replacement = 'Fu ', start = 5)
'Kung Fu Warrior'
>>> x = 'Kung ___ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 3)
'Kung Fu Warrior'
>>> x = 'Kung _ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 1)
'Kung Fu Warrior'
>>> x = 'My name is _NAME_ and I work in _LOC_. I am _AGE_ years old.'
>>> txt_insert(x, replacement = ['Jos', 'Dorpstraat 40, 1000 Brussel', '43'], start = [11, 32, 44], end = [16+1, 36+1, 48+1])
'My name is Jos and I work in Dorpstraat 40, 1000 Brussel. I am 43 years old.'
>>> #import re
>>> #[loc.start() for loc in re.finditer('_', x)]
blackbar.utils.txt_freq(x, sort=True)
Univariate frequencies
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
list
|
a list or an list-like object |
required |
sort
|
bool
|
boolean indicating to sort by frequency. Defaults to true. |
True
|
Returns:
Type | Description |
---|---|
a pandas DataFrame with columns key, freq, freq_pct indicating the frequencies of x |
Examples:
>>> from blackbar import *
>>> x = ["a", "b", "b", "a", "b"]
>>> freq = txt_freq(x)
Text cleaning
blackbar.utils.txt_clean_word2vec(text, ascii=True, lower=True)
Clean text (ascii/lowercasing) and split into words to be able to apply word2vec. Words are just extracted by splitting based in punctuation symbols and stripping text with leading/trailing spaces.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
A str |
required |
Returns:
Type | Description |
---|---|
List[str]
|
a List[str] where the text is converted to ASCII and lowercased |
Examples:
>>> from blackbar import *
>>> text = u'Dziennik zak‚óceƒ'
>>> txt_clean_word2vec(text)
['dziennik', 'zakoce']
>>> text = 'González, M.'
>>> txt_clean_word2vec(text)
['gonzalez', 'm']
blackbar.utils.ascii_translit(text)
Convert text to ASCII
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
A str |
required |
Returns:
Type | Description |
---|---|
str
|
a str where the text is converted to ASCII |
Examples:
>>> from blackbar import *
>>> text = u'Dziennik zak‚óceƒ'
>>> ascii_translit(text)
'Dziennik zakoce'
>>> text = 'González, M.'
>>> ascii_translit(text)
'Gonzalez, M.'
>>> ascii_translit('éêöà')
'eeoa'
Tokenization
blackbar.utils.tokenize_letters(x)
Splits text according into a list of letters
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A str |
required |
Returns:
Type | Description |
---|---|
List[str]
|
a List[str] |
Examples:
>>> from blackbar import *
>>> tokenize_letters('Hello World')
['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']
blackbar.utils.tokenize_spaces_punct(x)
Splits text according based on spaces and punctuations
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A str |
required |
Returns:
Type | Description |
---|---|
List[str]
|
a List[str] |
Examples:
>>> from blackbar import *
>>> tokenize_spaces_punct('Hello World. You want more?')
['Hello', 'World', 'You', 'want', 'more']
blackbar.utils.tokenize_lines(x)
Splits text in lines
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
str
|
A str |
required |
Returns:
Type | Description |
---|---|
List[str]
|
a List[str] |
Examples:
>>> from blackbar import *
>>> tokenize_lines('Hello World.\nYou want more? \n\nYeah baby!')
['Hello World.', 'You want more? ', 'Yeah baby!']