string indicating either ‘characters’ or ‘tokens’ indicating to tokenise by letter or by spaces and punctuations
'characters'
match
int
reward score given to a match, defaults to 2
2
mismatch
int
penalty score given to a mismatch, defaults to -1
-1
gap
int
penalty score given to a mismatch, defaults to -1
-1
lower
bool
boolean indicating to lowercase the texts, defaults to True
True
tokenizer
Callable[[str], List[str]]
a callable where a and b will be provided onto in case you want to use your own tokenizer instead of the tokenisation by letter and by spaces and punctuations. The tokenizer should return a list of tokens.
None
collapse
str
string indicating how to collapse the tokenized text. Defaults to ’’ for type = ‘characters’ and ’ ’ for type = ‘tokens’
None
edit_mark
str
string indicating how to display mismatches and gaps, defaults to ‘#’
'#'
method
str
string with the type of implementation, either ‘default’ (for type ‘characters’ or ‘tokens’) or ‘biopython’ (type ‘characters’ only)
'default'
kwargs
Any
optional, passed on to tokenizer
{}
Returns
Name
Type
Description
type
str
The alignment type: ‘characters’ or ‘tokens’
edit_mark
str
The edit_mark
weights
dict
a dictionary of weights provided to the function: match, mismatch and gap
sw
float
The Smith-Waterman local alignment score
similarity
float
Score between 0 and 1, calculated as the Smith-Waterman local alignment score / (the number of letters/words in the shortest text times the match weight)
matches
int
The number of matches found during alignment
mismatches
int
The number of mismatches found during alignment
a
dict
A dictionary with alignment information from the text provided in a. With elements - text: The provided character string of either a or b - tokens: A list of characters with the tokenised texts of a or b - n: The length of tokens - similarity: The similarity to a calculated as the Smith-Waterman local alignment score / (the number of letters/words in the a or b text times the match weight) - alignment: A dictionary with the following elements - text: The aligned text from either a or b where gaps/mismatches are filled up with the edit_mark symbol - tokens: The list of tokens which form the aligned text - n: The length of the aligned text - gaps: The number of gaps during alignment - from: The starting position in the full tokenised tokens element from either a or b where the aligned text is found. - to: The end position in the full tokenised tokens element from either a or b where the aligned text is found.
b
dict
Similar structure as the dictionary as returned in a
Examples
>>>from textalignment import smith_waterman>>> a ="Hello world - back in Brussels">>> b ="Hello zorld - in Brussel">>> sw = smith_waterman(a, b, type='characters', method ='default')>>> sw['similarity']0.8333333333333334>>> sw['a']['alignment']['text']'Hello w#orld - back in Brussel'>>> sw['b']['alignment']['text']'Hello #zorld - #####in Brussel'>>> sw = smith_waterman(a, b, type='characters', method ='biopython', match =2, mismatch =-1, gap =-1)>>> sw['similarity']0.8333333333333334>>> sw['a']['alignment']['text']'Hello world - back in Brussel'>>> sw['b']['alignment']['text']'Hello zorld -##### in Brussel'>>> a ="Ik ben op zoek naar Gaspard Tournely waar kan ik die vinden">>> b ="Gaspard Bourelly">>> sw = smith_waterman(a, b, type='characters', method ='default')>>> sw['a']['alignment']['text']'Gaspard T#ournel#y'>>> sw['b']['alignment']['text']'Gaspard #Bour#elly'>>>int(sw['sw'])25>>> sw['b']['similarity']0.78125>>>float(sw['sw'] / (sw['b']['n'] *2))0.78125>>> sw = smith_waterman(a, b, type='characters', method ='biopython', match =2, mismatch =-1, gap =-1)>>> sw['a']['alignment']['text']'Gaspard Tourne#ly'>>> sw['b']['alignment']['text']'Gaspard Bour#elly'>>>int(sw['sw'])25>>> sw['b']['similarity']0.78125>>>float(sw['sw'] / (sw['b']['n'] *2))0.78125>>> sw = smith_waterman(a, b, type='tokens')>>> sw['similarity']0.5>>> sw['a']['alignment']['text']'Gaspard'>>> sw['b']['alignment']['text']'Gaspard'>>># Unit tests to check that the end token corresponds>>> a ="Ik ben op zoek naar Gaspard Tournely waar kan ik die vinden">>> b ="Gaspard Bourelly">>> sw = smith_waterman(a, b, type='characters', method ='default')>>> a[sw["a"]["alignment"]["from"]:(sw["a"]["alignment"]["to"] +1)]'Gaspard Tournely'>>> a[sw["a"]["alignment"]["from"]]'G'>>> a[sw["a"]["alignment"]["to"]]'y'>>> sw = smith_waterman(a, b, type='characters', method ='biopython')>>> a[sw["a"]["alignment"]["from"]:(sw["a"]["alignment"]["to"] +1)]'Gaspard Tournely'>>> a[sw["a"]["alignment"]["from"]]'G'>>> a[sw["a"]["alignment"]["to"]]'y'>>># Unit test for long set of alignments (no sorting applied)>>> a ='\n\n\n\n\n\n\nDR. GAME KEEPERS\n\nElektronisch adres\n\n\n\n\n\n\n\n22 januari 1999\n\n\n\nBETREFT: JANSSENS Jean\n\n A948023ZRXYZ00B\n\nBRUSSELSESTEENWEG 111\n\n 1785 MERCHTEM \n\n\n\nGeachte collega\n\n\n\nUw patiënt, de heer Louis Jean JANSSENS, werd op de raadpleging gezien op 18/01/1998. \n\n\n\nReden van raadpleging: \n 6 weken na IHP rechts. \n Gaat goed, stapt vlot uit de wachtkamer met kruk in de hand. \n Zegt vooral laatste weken sterke verbetering beterschap te voelen.\n\nBevindingen: \n ROM: flexie 100° full extension.\nExo en endo goed en pijnloos.\n Goede straight leg. \n Goede kracht abductoren.\n\nBijkomende onderzoeken: \n RX: goede stand van de prothese.\n\nHouding: \n Kinesitherapie verder te zetten: gangrevalidatie en spierversterking \n\nMet de patiënt, de heer Jean JANSSENS, werd afgesproken hem terug te zien 6 maanden postoperatief met rx .\n\n\n\nMet collegiale groeten\n\n\n\n\n\nDR. RODEL BAAN DR. DE GROTE TEEN\n\nKliniekhoofd Assistente\n'>>> b ='BRUSSELSESTEENWEG 123 /0012 1785 MERCHTEM'>>> sw = smith_waterman(a, b, type='characters', method ='biopython')