smith_waterman_recursive

smith_waterman_recursive(
    a,
    b,
    threshold=0.5,
    which='both',
    type='characters',
    match=2,
    mismatch=-1,
    gap=-1,
    lower=True,
    tokenizer=None,
    collapse=None,
    edit_mark='#',
    method='default',
    **kwargs,
)

Recursively align 2 texts using Smith-Waterman by aligning using Smith-Waterman and if the similarity is above a certain threshold, apply Smith-Waterman again to the left/right of the text or in both directions. In order to identify b (a short lookup string) several times in a (the longest text).

Parameters

Name	Type	Description	Default
a	str	string with the first text	required
b	str	string with the second text	required
threshold	float	threshold indicating to only do the recursive operation if the similarity is above this treshold	`0.5`
which	str	string indicating either ‘both’, ‘left’ or ‘right’ indicating to recursively find for the string to the left or right or in both directions	`'both'`
type	str	string indicating either ‘characters’ or ‘tokens’ indicating to tokenise by letter or by spaces and punctuations	`'characters'`
match	int	reward score given to a match, defaults to 2	`2`
mismatch	int	penalty score given to a mismatch, defaults to -1	`-1`
gap	int	penalty score given to a mismatch, defaults to -1	`-1`
lower	bool	boolean indicating to lowercase the texts, defaults to True	`True`
tokenizer	Callable[[str], List[str]]	a callable where a and b will be provided onto in case you want to use your own tokenizer instead of the tokenisation by letter and by spaces and punctuations. The tokenizer should return a list of tokens.	`None`
collapse	str	string indicating how to collapse the tokenized text. Defaults to ’’ for type = ‘characters’ and ’ ’ for type = ‘tokens’	`None`
edit_mark	str	string indicating how to display mismatches and gaps, defaults to ‘#’	`'#'`
method	str	string with the type of implementation, either ‘default’ (for type ‘characters’ or ‘tokens’) or ‘biopython’ (type ‘characters’ only)	`'default'`
kwargs	Any	optional, passed on to tokenizer	`{}`

Returns

Name	Type	Description
	List[dict]	a list of dictionary elements as returned by smith_waterman

Examples

>>> from textalignment import smith_waterman_recursive
>>> a = "I am looking for John McNrow, where can I find John McAndRow?"
>>> b = "John McEnroe"
>>> sw = smith_waterman_recursive(a, b, type = 'characters')
>>> len(sw)
2
>>> [element['a']['alignment']['text'] for element in sw]
['John Mc#Nro', 'John McA#ndRo']