Utilities

Custom functionalities for specific hospitals

`blackbar.uzb.uzb_identify_chunks(x, type='patientid')`

Identify with regular expressions if a text contains either a patientid, a date, rijksregister

Parameters:

Name	Type	Description	Default
`x`	`str`	a text	required

Returns:

Type	Description
`bool`	bool indicating if the type is found

Examples:

>>> from blackbar import uzb_identify_chunks
>>> x = uzb_identify_chunks('hello world A930523DR00L/RVdV ok works', type = 'patientid')
>>> x = uzb_identify_chunks('hello world 123456789', type = 'patientid')
>>> x = uzb_identify_chunks('hello world 30/12/1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 30-12-1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 30.12.1978 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 1978.12.30 ok works', type = 'date')
>>> x = uzb_identify_chunks('hello world 78.12.30-014.53 ok works', type = 'rijksregister')
>>> x = uzb_identify_chunks('hello world 78.12.30-014 not ok', type = 'rijksregister')

`blackbar.uzb.uzb_harmonize_physician(x, flags=re.IGNORECASE)`

Harmonise names of physicians in the database by removing the Prof/Dr/Mevr/Mej/AP prefixes people put before their names Functionality for harmonising names: removing prof dr ... and (123456789)

Parameters:

Name	Type	Description	Default
`x`	`str`	a text	required

Returns:

Type	Description
	the str where prefixes are removed

Examples:

>>> from blackbar import uzb_harmonize_physician
>>> uzb_harmonize_physician('PROF.DR. Janssens Jan')
'Janssens Jan'
>>> uzb_harmonize_physician('    MEVR. Linda Wittevrongel')
'Linda Wittevrongel'

`blackbar.uzb.uzb_vn_achternaam(x, collapse=' ')`

Combine surname + family name. Where the first letter of the surname is taken only. E.g. ["JAN", "JANSSENS"] becomes "J. JANSSENS"

Parameters:

Name	Type	Description	Default
`x`	`List[str]`	a list with 2 elements: surname & family name	required
`collapse`	`str`	string how to collapse the surname & family name together	`' '`

Returns:

Type	Description
	the str where prefixes are removed

Examples:

>>> from blackbar import uzb_harmonize_physician
>>> x = ["JAN", "JANSSENS"]
>>> uzb_vn_achternaam(x)
'J. JANSSENS'
>>> x = ["Linda", "Wittevrongel"]
>>> uzb_vn_achternaam(x)
'L. Wittevrongel'

`blackbar.uzb.uzb_txt_contains(x, type)`

Check based on predefined regular expressions if a string contains a date, a datesymbol, a day of the week, an age, an hour or a streetindication TODO: maandag wordt nog gedetecteerd als maand TODO: 01 januari wordt nog gedetecteerd als age ook

Parameters:

Name	Type	Description	Default
`x`	`str`	a text	required
`type`	`str`	the type of element to look for. Possible values are 'date', 'datesymbol', 'dow', 'age', 'hour', 'streetindication'	required

Returns:

Type	Description
`bool`	a boolean indicating the pattern has been found

Examples:

>>> from blackbar import uzb_txt_contains
>>> text = 'Op maandag 01 januari 2022 15:15 kwam u langs. Het bleek een mooie dag te zijn'
>>> text = 'Op dinsdag 01 februari 2022 15:15 kwam u langs. Het bleek een mooie dag te zijn'
>>> uzb_txt_contains(text, type = 'date')
True
>>> uzb_txt_contains(text, type = 'datesymbol')
False
>>> uzb_txt_contains(text, type = 'dow')
True
>>> uzb_txt_contains(text, type = 'age')
False
>>> uzb_txt_contains(text, type = 'hour')
True
>>> uzb_txt_contains(text, type = 'streetindication')
False

`blackbar.uzb.uzb_detect_smith_waterman(data, log=False, alignment_method='default')`

Extract names/addresses based on Smith Waterman Either for patient names or for names of physicians that the patient got into contact with.

Parameters:

Name	Type	Description	Default
`data`	`Series`	the selction on 1 row from a pandas dataframe with the combination of enriched patient identifiers as extracted with deid_enrich_identifiers(type = 'patients') and identifiers of which physicians the patient got into contact with as extracted with deid_enrich_identifiers(type = 'patients_physicians')	required
`log`	`bool`	bool indicating to print the information in case of failure of the alignment	`False`
`alignment_method`	`str`	either 'default' or 'biopython', passed on to Smith_Waterman	`'default'`

Returns:

Type	Description
	A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similariy, start, end, length with the detected entities and the exact positions in the text

Examples:

>>> from blackbar import deid_enrich_identifiers, blackbar_example, deid_smith_waterman
>>> from rlike import *
>>> x = blackbar_example('patients_physicians')
>>> pats = deid_enrich_identifiers(x, type = "patients_physicians")
>>> docs = blackbar_example('documents')
>>> docs = deid_enrich_identifiers(docs, type = "patients")
>>> docs = docs.merge(pats, how = "left", left_on = "patientId", right_on = "pat_dos_nr")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "default")
>>> sw = deid_smith_waterman(docs.iloc[0], alignment_method = "biopython")
>>> sw = deid_smith_waterman(docs.iloc[1], alignment_method = "biopython")
>>> data = docs.iloc[0]
>>> ents = deid_smith_waterman(data, alignment_method = "biopython")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0
>>> ents = deid_smith_waterman(data, alignment_method = "default")
>>> ents["term_detected_start_end"] = substr(data["text"], list(ents["start"]), list(ents["end"]))
>>> sum(list(ents["term_detected_start_end"] != ents["term_detected"]))
0
>>> sum(nchar(ents["term_detected"]) != ents["length"])
0

`blackbar.uzb.uzb_enrich_identifiers(data, type='patients')`

Create combinations of names/address identifiers which can be use to lookup using Smith-Waterman. Either for patient names or for names of physicians that the patient got into contact with.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	a pandas dataframe with the following fields If type is 'patients_physicians': pat_dos_nr, hcact_vnaam, hcact_fam_naam, hcact_adres hcact_post_nr, hcact_gemeente If type is 'patients': performingPhysicianId, responsiblePhysicianId and pat_vnaam, pat_fam_naam, pat_adres, pat_post_nr, pat_gemeente	required
`type`	`str`	either 'patients' or 'patients_physicians'	`'patients'`

Returns:

Type

Description

If type is 'patients_physicians' returns: - a pandas dataframe with 1 row per patient with columns pat_dos_nr, physician_adres, physician_naam_kort, physician_naam_lang - the columns physician_adres, physician_naam_kort, physician_naam_lang indicate contain a list of all possible physicians the patient got into contact with, their address and different ways of writing these - the different ways are constructed by constructing an address - hcact_adres + hcact_post_nr + hcact_gemeente - hcact_post_nr + hcact_gemeente - the different ways are constructed by constructing a name (note only first name: hcact_vnaam is not used) - hcact_fam_naam - hcact_vnaam + hcact_fam_naam - hcact_fam_naam + hcact_vnaam - first letter of hcact_vnaam. + hcact_fam_naam (e.g. J. Janssens)

If type is 'patients' returns the pandas DataFrame data with the following extra columns added/changed - Physician info - performingPhysicianId, responsiblePhysicianId: harmonising by removing 'Prof/Dr/Mevr/Mej/AP' prefixes people put before their names - performing_dr_anvn, performing_dr_vnan, performing_dr_vnanafk, performing_dr_an, performing_dr_vn (combining first name / last name based on information in performingPhysicianId) - responsible_dr_anvn, responsible_dr_vnan, responsible_dr_vnanafk, responsible_dr_an, responsible_dr_vn (combining first name / last name based on information in responsiblePhysicianId) - Patient info: - pat_anvn, pat_vnan, pat_vnanafk, pat_vnaam, pat_fam_naam (first name / last name combinations and J. Janssens) - pat_adresgegevens, pat_postcode_gemeente, pat_adres

Examples:

>>> from blackbar import deid_enrich_identifiers, PseudoGenerator, blackbar_example
>>> import pandas as pd
>>> pseudo = PseudoGenerator()
>>> x = pd.DataFrame({"pat_dos_nr": pseudo.generate(type = "ID_Patient", n = 1), "hcact_vnaam": ["Jan", "Piet"], "hcact_fam_naam": ["Janssens", "Peeters"], "hcact_adres": ["Stormy Daneelsstraat 125", "Stationsstraat 321"], "hcact_post_nr": ["1000", "1090"], "hcact_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('patients_physicians')
>>> d = deid_enrich_identifiers(x, type = "patients_physicians")
>>> d = d[d["pat_dos_nr"] == "Y871129di17X"].reset_index()
>>> d["physician_naam_lang"][0]
['J. Janssens', 'P. Peeters', 'Janssens Jan', 'Peeters Piet', 'Jan Janssens', 'Piet Peeters']
>>> d["physician_adres"][0]
['Stormy Daneelsstraat 125', 'Stationsstraat 321', 'Stormy Daneelsstraat 125 1000 Brussel', 'Stationsstraat 321 1090 Jette', '1000 Brussel', '1090 Jette']
>>> x = pd.DataFrame({"patientId": pseudo.generate(type = "ID_Patient", n = 1), "performingPhysicianId": ["PROF.DR. JANSSENS, JAN", "DR. PEETERS, PIET"], "responsiblePhysicianId": [None, "PROF.DR. JANSSENS, JAN"], "pat_vnaam": ["Mehdi", "Jos"], "pat_fam_naam": ["Olek", "Vermeulen"], "pat_adres": ["Grote Beek 9", "Laarbeeklaan 99"], "pat_post_nr": ["1000", "1090"], "pat_gemeente": ["Brussel", "Jette"]})
>>> x = blackbar_example('documents')
>>> d = deid_enrich_identifiers(x, type = "patients")
>>> d = d[d["patientId"] == "Y871129di17X"].reset_index()
>>> d["performingPhysicianId"][0]
'JANSSENS, JAN'
>>> d["pat_anvn"][0]
'Olek Mehdi'
>>> d["pat_vnanafk"][0]
'M. Olek'
>>> d["pat_adresgegevens"][0]
'Grote Beek 9 1000 Brussel'
>>> d["pat_postcode_gemeente"][0]
'1000 Brussel'

`blackbar.uzb.combine_chunkranges(chunks_sw, chunks_nlp, text)`

Chunks / spans

`blackbar.blackbar.merge_chunkranges(data, text, include_end=False)`

Combine overlapping chunk ranges by finding overlapping ranges. Overlapping ranges are handled as follows. We first order the ranges by the starting positions. If the start value of the chunk falls within the previous chunk range (smaller than the previous end value - it overlaps) we extend the previous end value with the new end value and keep the longest search string (term_search) as we assume the detected chunk ranges are coming from a Smith-Waterman alignment.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	A data frame with columns doc_id, label_, term_search, start, end, similarity indicating the location of detected entities, where the end value is the end position in the text + 1	required
`text`	`str`	a text string based on which the entities in data were found	required
`include_end`	`bool`	bool indicating to include the end	`False`

Returns:

Type	Description
	A pandas DataFrame with columns doc_id, label_, term_search, term_detected, similarity, start, end, length indicating

Examples:

>>> from pandas import DataFrame
>>> from rlike import substr
>>> from blackbar.data import blackbar_example
>>> text = blackbar_example()
>>> ent = {'doc_id': text, 'similarity': 0, 'term_search': ['A', 'AA', 'B', 'B', 'C', 'CC', 'D', 'DD'], 'start': [10, 10+5, 93, 242, 343 + 4, 343, 384, 384+4], 'end': [19, 19, 98, 247, 343 + 7, 359, 400, 400+2], 'label_': ['Date', 'Date', 'Date', 'Date', 'Name', 'Name', 'Name', 'Name']} 
>>> ent = DataFrame(ent)
>>> substr(ent["doc_id"], start = list(ent["start"]), end = list(ent["end"] - 1))
['10/4/2021', '2021', '30/03', '09/04', 'Jan', 'Dr. Jan Janssens', 'Dr. Jan Janssens', 'Jan Janssens |']
>>> chunkranges = merge_chunkranges(ent, text)
>>> list(chunkranges["term_detected"])
['10/4/2021', '30/03', '09/04', 'Dr. Jan Janssens', 'Dr. Jan Janssens |']

General

`blackbar.utils.na_exclude(x)`

Exclude elements with missing data from a list-like array

Parameters:

Name	Type	Description	Default
`x`		A list	required

Returns:

Type	Description
	A list where missing data are removed

Examples:

>>> from blackbar import *
>>> import pandas as pd
>>> import numpy as np
>>> x = [1, np.nan, 3, 4, pd.NA, 6, None, 99]
>>> na_exclude(x) 
[1, 3, 4, 6, 99]

`blackbar.utils.chunk(x, n=2)`

Split a list in n chunks

Parameters:

Name	Type	Description	Default
`x`	`List`	A list	required
`n`	`int`	integer with the number of chunks. Defaults to 2	`2`

Returns:

Type	Description
	yields an iterator over the list in chunks of size n

Examples:

>>> from blackbar import *
>>> x = range(14)
>>> x = list(x)
>>> it = chunk(x, n = 2)
>>> next(it)
[0, 2, 4, 6, 8, 10, 12]
>>> next(it)
[1, 3, 5, 7, 9, 11, 13]

General text features

`blackbar.utils.txt_n_capital(x)`

Count the number of capitalised letters

Parameters:

Name	Type	Description	Default
`x`	`str`	A str	required

Returns:

Type	Description
`int`	integer with the number of capitalised letters in x

Examples:

>>> from blackbar import *
>>> txt_n_capital('Hello World')
2

`blackbar.utils.txt_n_newlines(x)`

Count the number of newlines

Parameters:

Name	Type	Description	Default
`x`	`str`	A str	required

Returns:

Type	Description
`int`	integer with the number of newlines in x

Examples:

>>> from blackbar import *
>>> txt_n_newlines('Hello World')
0
>>> txt_n_newlines(['Hello World', None])
[0, None]

`blackbar.utils.txt_contains_lot_of_capitals(x, threshold=0.5, min_length=2)`

Test if a string contains a lot of capitals by looking at how much of the string is in capital and a minimum length of the string

Parameters:

Name	Type	Description	Default
`x`	`str`	A str	required
`threshold`	`float`	Percent of the number of letters which should have a capital	`0.5`
`min_length`	`int`	Minimum length of the number of letters in x	`2`

Returns:

Type	Description
`int`	bool indicating if x contains a lot of letters in capital case

Examples:

>>> from blackbar import *
>>> x = 'HELLO There'
>>> txt_contains_lot_of_capitals('HELLO There', threshold = 0.9)
False
>>> txt_contains_lot_of_capitals('HELLO There', threshold = 0.5)
True
>>> pct = txt_n_capital(x) / len(x)

`blackbar.utils.txt_contains(x, pattern, is_regex=False, flags=re.IGNORECASE)`

Check if a string contains a certain pattern

Parameters:

Name	Type	Description	Default
`x`	`str`	A str	required
`pattern`	`str`	the pattern to look up in x	required
`is_regex`	`bool`	boolean indicating if pattern is a regular expression	`False`
`flags`		passed on to the flags argument of re.search in case is_regex is True	`IGNORECASE`

Returns:

Type	Description
`bool`	bool indicating if the string x contains the pattern

Examples:

>>> from blackbar import *
>>> txt_contains('Hello World', pattern = 'wo', is_regex = True)
True
>>> txt_contains('Hello World', pattern = 'wo', is_regex = False)
False
>>> txt_contains('Hello World', pattern = 'World')
True

`blackbar.utils.txt_leading_trailing(x, type='leading')`

Gets the leading or trailing spaces

Parameters:

Name	Type	Description	Default
`x`	`Union[str, List[str]]`	a str, a list or an list-like object with text	required
`type`	`str`	either 'leading' or 'trailing'	`'leading'`

Returns:

Type	Description
	a list of the same length as x or a str

Examples:

>>> from blackbar import *
>>> x = ' \n Hello world   \r\n '
>>> txt_leading_trailing(x, type = 'leading')
' \n '
>>> txt_leading_trailing(x, type = 'trailing')
'   \r\n '
>>> x = ['  Hello world    ', '  ABCDEF']
>>> txt_leading_trailing(x, type = 'leading')
['  ', '  ']
>>> txt_leading_trailing(x, type = 'trailing')
['    ', '']

`blackbar.utils.txt_trailing_spaces(x)`

General text processing

`blackbar.utils.txt_sample(x, n=1)`

Sample from a list

Parameters:

Name	Type	Description	Default
`x`	`List`	A str	required

Returns:

Type	Description
`List`	A list with a sample of n elements from the List x

Examples:

>>> from blackbar import *
>>> x = txt_sample(['a', 'b', 'c'], n = 2)

`blackbar.utils.txt_paste(*lists, sep=' ', collapse=None)`

Paste text together while removing None values

Parameters:

Name	Type	Description	Default
`lists`	`str \| List[str]`	A str or a list of str	`()`
`collapse`	`Union[str, None]`	A str indicating how to collapse a list together	`None`
`sep`	`str`	A str indicating how to paste several elements together	`' '`

Returns:

Type	Description
	A list

Examples:

>>> from blackbar import *
>>> txt_paste(["a", "b", "c"], collapse = ", ") 
'a, b, c'
>>> txt_paste(["a", None, "b"], collapse = ", ") 
'a, b'
>>> txt_paste(["a", "b", "c"], ["1", "2", "3"], sep = "-") 
['a-1', 'b-2', 'c-3']
>>> txt_paste(["a", "b", None], ["1", "2", "3"], sep = "-")
['a-1', 'b-2', '3']

`blackbar.utils.txt_insert(x, replacement, start, end=None, reverse=True)`

Insert another text in a string

Parameters:

Name	Type	Description	Default
`x`	`str`	A text string	required
`replacement`	`Union[str, List[str]]`	A text replacement	required
`start`	`Union[int, List[int]]`	start position in x where to put the replacement	required
`end`	`Union[int, List[int], None]`	end position in x where to end the insertion of the replacement	`None`
`reverse`	`bool`	logical indicating to do the replacement starting from the back. Defaults to True.	`True`

Returns:

Type	Description
	x where the specified sections are replaced

Examples:

>>> from blackbar import *
>>> x = 'Kung Warrior'
>>> txt_insert(x, replacement = 'Fu ', start = 5)
'Kung Fu Warrior'
>>> x = 'Kung ___ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 3)
'Kung Fu Warrior'
>>> x = 'Kung _ Warrior'
>>> txt_insert(x, replacement = 'Fu', start = 5, end = 5 + 1)
'Kung Fu Warrior'
>>> x = 'My name is _NAME_ and I work in _LOC_. I am _AGE_ years old.'
>>> txt_insert(x, replacement = ['Jos', 'Dorpstraat 40, 1000 Brussel', '43'], start = [11, 32, 44], end = [16+1, 36+1, 48+1])
'My name is Jos and I work in Dorpstraat 40, 1000 Brussel. I am 43 years old.'
>>> #import re     
>>> #[loc.start() for loc in re.finditer('_', x)]

`blackbar.utils.txt_freq(x, sort=True)`

Univariate frequencies

Parameters:

Name	Type	Description	Default
`x`	`list`	a list or an list-like object	required
`sort`	`bool`	boolean indicating to sort by frequency. Defaults to true.	`True`

Returns:

Type	Description
	a pandas DataFrame with columns key, freq, freq_pct indicating the frequencies of x

Examples:

>>> from blackbar import *
>>> x = ["a", "b", "b", "a", "b"]
>>> freq = txt_freq(x)

Text cleaning

`blackbar.utils.txt_clean_word2vec(text, ascii=True, lower=True)`

Clean text (ascii/lowercasing) and split into words to be able to apply word2vec. Words are just extracted by splitting based in punctuation symbols and stripping text with leading/trailing spaces.

Parameters:

Name	Type	Description	Default
`text`	`str`	A str	required

Returns:

Type	Description
`List[str]`	a List[str] where the text is converted to ASCII and lowercased

Examples:

>>> from blackbar import *
>>> text = u'Dziennik zak‚óceƒ'
>>> txt_clean_word2vec(text)
['dziennik', 'zakoce']
>>> text = 'González, M.'
>>> txt_clean_word2vec(text)
['gonzalez', 'm']

`blackbar.utils.ascii_translit(text)`

Convert text to ASCII

Parameters:

Name	Type	Description	Default
`text`	`str`	A str	required

Returns:

Type	Description
`str`	a str where the text is converted to ASCII

Examples:

>>> from blackbar import *
>>> text = u'Dziennik zak‚óceƒ'
>>> ascii_translit(text)
'Dziennik zakoce'
>>> text = 'González, M.'
>>> ascii_translit(text)
'Gonzalez, M.'
>>> ascii_translit('éêöà')
'eeoa'

Tokenization

`blackbar.utils.tokenize_letters(x)`

Splits text according into a list of letters

Parameters:

Name	Type	Description	Default
`x`	`str`	A str	required

Returns:

Type	Description
`List[str]`	a List[str]

Examples:

>>> from blackbar import *
>>> tokenize_letters('Hello World')
['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

`blackbar.utils.tokenize_spaces_punct(x)`

Splits text according based on spaces and punctuations

Parameters:

Name	Type	Description	Default
`x`	`str`	A str	required

Returns:

Type	Description
`List[str]`	a List[str]

Examples:

>>> from blackbar import *
>>> tokenize_spaces_punct('Hello World. You want more?')
['Hello', 'World', 'You', 'want', 'more']

`blackbar.utils.tokenize_lines(x)`

Splits text in lines

Parameters:

Name	Type	Description	Default
`x`	`str`	A str	required

Returns:

Type	Description
`List[str]`	a List[str]

Examples:

>>> from blackbar import *
>>> tokenize_lines('Hello World.\nYou want more? \n\nYeah baby!')
['Hello World.', 'You want more? ', 'Yeah baby!']