smith_waterman_recursive

smith_waterman_recursive(
    a,
    b,
    threshold=0.5,
    which='both',
    type='characters',
    match=2,
    mismatch=-1,
    gap=-1,
    lower=True,
    tokenizer=None,
    collapse=None,
    edit_mark='#',
    method='default',
    **kwargs,
)

Recursively align 2 texts using Smith-Waterman by aligning using Smith-Waterman and if the similarity is above a certain threshold, apply Smith-Waterman again to the left/right of the text or in both directions. In order to identify b (a short lookup string) several times in a (the longest text).

Parameters

Name Type Description Default
a str string with the first text required
b str string with the second text required
threshold float threshold indicating to only do the recursive operation if the similarity is above this treshold 0.5
which str string indicating either ‘both’, ‘left’ or ‘right’ indicating to recursively find for the string to the left or right or in both directions 'both'
type str string indicating either ‘characters’ or ‘tokens’ indicating to tokenise by letter or by spaces and punctuations 'characters'
match int reward score given to a match, defaults to 2 2
mismatch int penalty score given to a mismatch, defaults to -1 -1
gap int penalty score given to a mismatch, defaults to -1 -1
lower bool boolean indicating to lowercase the texts, defaults to True True
tokenizer Callable[[str], List[str]] a callable where a and b will be provided onto in case you want to use your own tokenizer instead of the tokenisation by letter and by spaces and punctuations. The tokenizer should return a list of tokens. None
collapse str string indicating how to collapse the tokenized text. Defaults to ’’ for type = ‘characters’ and ’ ’ for type = ‘tokens’ None
edit_mark str string indicating how to display mismatches and gaps, defaults to ‘#’ '#'
method str string with the type of implementation, either ‘default’ (for type ‘characters’ or ‘tokens’) or ‘biopython’ (type ‘characters’ only) 'default'
kwargs Any optional, passed on to tokenizer {}

Returns

Name Type Description
List[dict] a list of dictionary elements as returned by smith_waterman

Examples

>>> from textalignment import smith_waterman_recursive
>>> a = "I am looking for John McNrow, where can I find John McAndRow?"
>>> b = "John McEnroe"
>>> sw = smith_waterman_recursive(a, b, type = 'characters')
>>> len(sw)
2
>>> [element['a']['alignment']['text'] for element in sw]
['John Mc#Nro', 'John McA#ndRo']