smith_waterman

smith_waterman(
    a,
    b,
    type='characters',
    match=2,
    mismatch=-1,
    gap=-1,
    lower=True,
    tokenizer=None,
    collapse=None,
    edit_mark='#',
    method='default',
    **kwargs,
)

Align 2 texts using Smith-Waterman

Parameters

Name Type Description Default
a str string with the first text required
b str string with the second text required
type str string indicating either ‘characters’ or ‘tokens’ indicating to tokenise by letter or by spaces and punctuations 'characters'
match int reward score given to a match, defaults to 2 2
mismatch int penalty score given to a mismatch, defaults to -1 -1
gap int penalty score given to a mismatch, defaults to -1 -1
lower bool boolean indicating to lowercase the texts, defaults to True True
tokenizer Callable[[str], List[str]] a callable where a and b will be provided onto in case you want to use your own tokenizer instead of the tokenisation by letter and by spaces and punctuations. The tokenizer should return a list of tokens. None
collapse str string indicating how to collapse the tokenized text. Defaults to ’’ for type = ‘characters’ and ’ ’ for type = ‘tokens’ None
edit_mark str string indicating how to display mismatches and gaps, defaults to ‘#’ '#'
method str string with the type of implementation, either ‘default’ (for type ‘characters’ or ‘tokens’) or ‘biopython’ (type ‘characters’ only) 'default'
kwargs Any optional, passed on to tokenizer {}

Returns

Name Type Description
type str The alignment type: ‘characters’ or ‘tokens’
edit_mark str The edit_mark
weights dict a dictionary of weights provided to the function: match, mismatch and gap
sw float The Smith-Waterman local alignment score
similarity float Score between 0 and 1, calculated as the Smith-Waterman local alignment score / (the number of letters/words in the shortest text times the match weight)
matches int The number of matches found during alignment
mismatches int The number of mismatches found during alignment
a dict A dictionary with alignment information from the text provided in a. With elements - text: The provided character string of either a or b - tokens: A list of characters with the tokenised texts of a or b - n: The length of tokens - similarity: The similarity to a calculated as the Smith-Waterman local alignment score / (the number of letters/words in the a or b text times the match weight) - alignment: A dictionary with the following elements - text: The aligned text from either a or b where gaps/mismatches are filled up with the edit_mark symbol - tokens: The list of tokens which form the aligned text - n: The length of the aligned text - gaps: The number of gaps during alignment - from: The starting position in the full tokenised tokens element from either a or b where the aligned text is found. - to: The end position in the full tokenised tokens element from either a or b where the aligned text is found.
b dict Similar structure as the dictionary as returned in a

Examples

>>> from textalignment import smith_waterman
>>> a = "Hello world - back in Brussels"
>>> b = "Hello zorld - in Brussel"
>>> sw = smith_waterman(a, b, type = 'characters', method = 'default')
>>> sw['similarity']
0.8333333333333334
>>> sw['a']['alignment']['text']
'Hello w#orld - back in Brussel'
>>> sw['b']['alignment']['text']
'Hello #zorld - #####in Brussel'
>>> sw = smith_waterman(a, b, type = 'characters', method = 'biopython', match = 2, mismatch = -1, gap = -1)
>>> sw['similarity']
0.8333333333333334
>>> sw['a']['alignment']['text']
'Hello world - back in Brussel'
>>> sw['b']['alignment']['text']
'Hello zorld -##### in Brussel'
>>> a = "Ik ben op zoek naar Gaspard Tournely waar kan ik die vinden"
>>> b = "Gaspard Bourelly"
>>> sw = smith_waterman(a, b, type = 'characters', method = 'default')
>>> sw['a']['alignment']['text']
'Gaspard T#ournel#y'
>>> sw['b']['alignment']['text']
'Gaspard #Bour#elly'
>>> int(sw['sw'])
25
>>> sw['b']['similarity']
0.78125
>>> float(sw['sw'] / (sw['b']['n'] * 2))
0.78125
>>> sw = smith_waterman(a, b, type = 'characters', method = 'biopython', match = 2, mismatch = -1, gap = -1)
>>> sw['a']['alignment']['text']
'Gaspard Tourne#ly'
>>> sw['b']['alignment']['text']
'Gaspard Bour#elly'
>>> int(sw['sw'])
25
>>> sw['b']['similarity']
0.78125
>>> float(sw['sw'] / (sw['b']['n'] * 2))
0.78125
>>> sw = smith_waterman(a, b, type = 'tokens')
>>> sw['similarity']
0.5
>>> sw['a']['alignment']['text']
'Gaspard'
>>> sw['b']['alignment']['text']
'Gaspard'
>>> # Unit tests to check that the end token corresponds
>>> a = "Ik ben op zoek naar Gaspard Tournely waar kan ik die vinden"
>>> b = "Gaspard Bourelly"
>>> sw = smith_waterman(a, b, type = 'characters', method = 'default')
>>> a[sw["a"]["alignment"]["from"]:(sw["a"]["alignment"]["to"] + 1)]
'Gaspard Tournely'
>>> a[sw["a"]["alignment"]["from"]]
'G'
>>> a[sw["a"]["alignment"]["to"]]
'y'
>>> sw = smith_waterman(a, b, type = 'characters', method = 'biopython')
>>> a[sw["a"]["alignment"]["from"]:(sw["a"]["alignment"]["to"] + 1)]
'Gaspard Tournely'
>>> a[sw["a"]["alignment"]["from"]]
'G'
>>> a[sw["a"]["alignment"]["to"]]
'y'
>>> # Unit test for long set of alignments (no sorting applied)
>>> a = '\n \n\n \n\n \n\nDR. GAME KEEPERS\n\nElektronisch adres\n\n   \n\n \n\n \n\n22 januari 1999\n\n \n\nBETREFT:         JANSSENS Jean\n\n                        A948023ZRXYZ00B\n\nBRUSSELSESTEENWEG 111\n\n                        1785 MERCHTEM                                                      \n\n \n\nGeachte collega\n\n \n\nUw patiënt, de heer Louis Jean JANSSENS, werd op de raadpleging gezien op 18/01/1998.  \n\n \n\nReden van raadpleging: \n 6 weken na IHP rechts. \n Gaat goed, stapt vlot uit de wachtkamer met kruk in de hand. \n Zegt vooral laatste weken sterke verbetering beterschap te voelen.\n\nBevindingen: \n ROM: flexie 100° full extension.\nExo en endo goed en pijnloos.\n Goede straight leg. \n Goede kracht abductoren.\n\nBijkomende onderzoeken: \n RX: goede stand van de prothese.\n\nHouding: \n Kinesitherapie verder te zetten: gangrevalidatie en spierversterking \n\nMet de patiënt, de heer Jean JANSSENS, werd afgesproken hem terug te zien 6 maanden postoperatief met rx .\n\n \n\nMet collegiale groeten\n\n \n\n \n\nDR. RODEL BAAN                                                 DR. DE GROTE TEEN\n\nKliniekhoofd                                                     Assistente\n'
>>> b = 'BRUSSELSESTEENWEG 123 /0012 1785 MERCHTEM'
>>> sw = smith_waterman(a, b, type = 'characters', method = 'biopython')