smith_waterman

smith_waterman(
    a,
    b,
    type='characters',
    match=2,
    mismatch=-1,
    gap=-1,
    lower=True,
    tokenizer=None,
    collapse=None,
    edit_mark='#',
    method='default',
    **kwargs,
)

Align 2 texts using Smith-Waterman

Parameters

Name	Type	Description	Default
a	str	string with the first text	required
b	str	string with the second text	required
type	str	string indicating either ‘characters’ or ‘tokens’ indicating to tokenise by letter or by spaces and punctuations	`'characters'`
match	int	reward score given to a match, defaults to 2	`2`
mismatch	int	penalty score given to a mismatch, defaults to -1	`-1`
gap	int	penalty score given to a mismatch, defaults to -1	`-1`
lower	bool	boolean indicating to lowercase the texts, defaults to True	`True`
tokenizer	Callable[[str], List[str]]	a callable where a and b will be provided onto in case you want to use your own tokenizer instead of the tokenisation by letter and by spaces and punctuations. The tokenizer should return a list of tokens.	`None`
collapse	str	string indicating how to collapse the tokenized text. Defaults to ’’ for type = ‘characters’ and ’ ’ for type = ‘tokens’	`None`
edit_mark	str	string indicating how to display mismatches and gaps, defaults to ‘#’	`'#'`
method	str	string with the type of implementation, either ‘default’ (for type ‘characters’ or ‘tokens’) or ‘biopython’ (type ‘characters’ only)	`'default'`
kwargs	Any	optional, passed on to tokenizer	`{}`

Returns

Name	Type	Description
type	str	The alignment type: ‘characters’ or ‘tokens’
edit_mark	str	The edit_mark
weights	dict	a dictionary of weights provided to the function: match, mismatch and gap
sw	float	The Smith-Waterman local alignment score
similarity	float	Score between 0 and 1, calculated as the Smith-Waterman local alignment score / (the number of letters/words in the shortest text times the match weight)
matches	int	The number of matches found during alignment
mismatches	int	The number of mismatches found during alignment
a	dict	A dictionary with alignment information from the text provided in a. With elements - text: The provided character string of either a or b - tokens: A list of characters with the tokenised texts of a or b - n: The length of tokens - similarity: The similarity to a calculated as the Smith-Waterman local alignment score / (the number of letters/words in the a or b text times the match weight) - alignment: A dictionary with the following elements - text: The aligned text from either a or b where gaps/mismatches are filled up with the edit_mark symbol - tokens: The list of tokens which form the aligned text - n: The length of the aligned text - gaps: The number of gaps during alignment - from: The starting position in the full tokenised tokens element from either a or b where the aligned text is found. - to: The end position in the full tokenised tokens element from either a or b where the aligned text is found.
b	dict	Similar structure as the dictionary as returned in a

Examples

>>> from textalignment import smith_waterman
>>> a = "Hello world - back in Brussels"
>>> b = "Hello zorld - in Brussel"
>>> sw = smith_waterman(a, b, type = 'characters', method = 'default')
>>> sw['similarity']
0.8333333333333334
>>> sw['a']['alignment']['text']
'Hello w#orld - back in Brussel'
>>> sw['b']['alignment']['text']
'Hello #zorld - #####in Brussel'
>>> sw = smith_waterman(a, b, type = 'characters', method = 'biopython', match = 2, mismatch = -1, gap = -1)
>>> sw['similarity']
0.8333333333333334
>>> sw['a']['alignment']['text']
'Hello world - back in Brussel'
>>> sw['b']['alignment']['text']
'Hello zorld -##### in Brussel'
>>> a = "Ik ben op zoek naar Gaspard Tournely waar kan ik die vinden"
>>> b = "Gaspard Bourelly"
>>> sw = smith_waterman(a, b, type = 'characters', method = 'default')
>>> sw['a']['alignment']['text']
'Gaspard T#ournel#y'
>>> sw['b']['alignment']['text']
'Gaspard #Bour#elly'
>>> int(sw['sw'])
25
>>> sw['b']['similarity']
0.78125
>>> float(sw['sw'] / (sw['b']['n'] * 2))
0.78125
>>> sw = smith_waterman(a, b, type = 'characters', method = 'biopython', match = 2, mismatch = -1, gap = -1)
>>> sw['a']['alignment']['text']
'Gaspard Tourne#ly'
>>> sw['b']['alignment']['text']
'Gaspard Bour#elly'
>>> int(sw['sw'])
25
>>> sw['b']['similarity']
0.78125
>>> float(sw['sw'] / (sw['b']['n'] * 2))
0.78125
>>> sw = smith_waterman(a, b, type = 'tokens')
>>> sw['similarity']
0.5
>>> sw['a']['alignment']['text']
'Gaspard'
>>> sw['b']['alignment']['text']
'Gaspard'
>>> # Unit tests to check that the end token corresponds
>>> a = "Ik ben op zoek naar Gaspard Tournely waar kan ik die vinden"
>>> b = "Gaspard Bourelly"
>>> sw = smith_waterman(a, b, type = 'characters', method = 'default')
>>> a[sw["a"]["alignment"]["from"]:(sw["a"]["alignment"]["to"] + 1)]
'Gaspard Tournely'
>>> a[sw["a"]["alignment"]["from"]]
'G'
>>> a[sw["a"]["alignment"]["to"]]
'y'
>>> sw = smith_waterman(a, b, type = 'characters', method = 'biopython')
>>> a[sw["a"]["alignment"]["from"]:(sw["a"]["alignment"]["to"] + 1)]
'Gaspard Tournely'
>>> a[sw["a"]["alignment"]["from"]]
'G'
>>> a[sw["a"]["alignment"]["to"]]
'y'
>>> # Unit test for long set of alignments (no sorting applied)
>>> a = '\n \n\n \n\n \n\nDR. GAME KEEPERS\n\nElektronisch adres\n\n   \n\n \n\n \n\n22 januari 1999\n\n \n\nBETREFT:         JANSSENS Jean\n\n                        A948023ZRXYZ00B\n\nBRUSSELSESTEENWEG 111\n\n                        1785 MERCHTEM                                                      \n\n \n\nGeachte collega\n\n \n\nUw patiënt, de heer Louis Jean JANSSENS, werd op de raadpleging gezien op 18/01/1998.  \n\n \n\nReden van raadpleging: \n 6 weken na IHP rechts. \n Gaat goed, stapt vlot uit de wachtkamer met kruk in de hand. \n Zegt vooral laatste weken sterke verbetering beterschap te voelen.\n\nBevindingen: \n ROM: flexie 100° full extension.\nExo en endo goed en pijnloos.\n Goede straight leg. \n Goede kracht abductoren.\n\nBijkomende onderzoeken: \n RX: goede stand van de prothese.\n\nHouding: \n Kinesitherapie verder te zetten: gangrevalidatie en spierversterking \n\nMet de patiënt, de heer Jean JANSSENS, werd afgesproken hem terug te zien 6 maanden postoperatief met rx .\n\n \n\nMet collegiale groeten\n\n \n\n \n\nDR. RODEL BAAN                                                 DR. DE GROTE TEEN\n\nKliniekhoofd                                                     Assistente\n'
>>> b = 'BRUSSELSESTEENWEG 123 /0012 1785 MERCHTEM'
>>> sw = smith_waterman(a, b, type = 'characters', method = 'biopython')