Recursively align 2 texts using Smith-Waterman by aligning using Smith-Waterman and if the similarity is above a certain threshold, apply Smith-Waterman again to the left/right of the text or in both directions. In order to identify b (a short lookup string) several times in a (the longest text).
Parameters
Name
Type
Description
Default
a
str
string with the first text
required
b
str
string with the second text
required
threshold
float
threshold indicating to only do the recursive operation if the similarity is above this treshold
0.5
which
str
string indicating either ‘both’, ‘left’ or ‘right’ indicating to recursively find for the string to the left or right or in both directions
'both'
type
str
string indicating either ‘characters’ or ‘tokens’ indicating to tokenise by letter or by spaces and punctuations
'characters'
match
int
reward score given to a match, defaults to 2
2
mismatch
int
penalty score given to a mismatch, defaults to -1
-1
gap
int
penalty score given to a mismatch, defaults to -1
-1
lower
bool
boolean indicating to lowercase the texts, defaults to True
True
tokenizer
Callable[[str], List[str]]
a callable where a and b will be provided onto in case you want to use your own tokenizer instead of the tokenisation by letter and by spaces and punctuations. The tokenizer should return a list of tokens.
None
collapse
str
string indicating how to collapse the tokenized text. Defaults to ’’ for type = ‘characters’ and ’ ’ for type = ‘tokens’
None
edit_mark
str
string indicating how to display mismatches and gaps, defaults to ‘#’
'#'
method
str
string with the type of implementation, either ‘default’ (for type ‘characters’ or ‘tokens’) or ‘biopython’ (type ‘characters’ only)
'default'
kwargs
Any
optional, passed on to tokenizer
{}
Returns
Name
Type
Description
List[dict]
a list of dictionary elements as returned by smith_waterman
Examples
>>>from textalignment import smith_waterman_recursive>>> a ="I am looking for John McNrow, where can I find John McAndRow?">>> b ="John McEnroe">>> sw = smith_waterman_recursive(a, b, type='characters')>>>len(sw)2>>> [element['a']['alignment']['text'] for element in sw]['John Mc#Nro', 'John McA#ndRo']