smithwaterman.Smith_Waterman

smithwaterman.Smith_Waterman(a, b, method='default')

Perform multiple pairwise alignments using Smith Waterman for looking up a short text in a longer text. For example to search several names in a text.

Parameters

Name Type Description Default
a a str with the text required
b a str or a list of str with the texts to lookup into a required
method str either ‘default’ or ‘biopython’ 'default'

Attributes

Name Type Description
a (str) a str with the text
b (str List[str])
n (int) the length of b
match (object) the result of the match between a and b using optimal_alignment or recursive_alignment
method (str) either ‘default’ or ‘biopython’

Examples

>>> from textalignment import Smith_Waterman
>>> import pandas as pd
>>> text = "I am looking for John McNrow, where can I find John McAndRow?"
>>> ##
>>> ## Optimal alignment
>>> ##
>>> sw = Smith_Waterman(a = text, b = ["John McEnroe", "McEnroe John"])
>>> m = sw.optimal_alignment(type = 'characters')
>>> m = sw.optimal_alignment(type = 'characters', method = 'default')
>>> m = sw.optimal_alignment(type = 'characters', method = 'biopython')
>>> similarities = sw.as_data_frame()
>>> sw = Smith_Waterman(a = text, b = "John McEnroe")
>>> m = sw.optimal_alignment(type = 'characters')
>>> m = sw.optimal_alignment(type = 'characters', method = 'default')
>>> m = sw.optimal_alignment(type = 'characters', method = 'biopython')
>>> similarities = sw.as_data_frame()
>>>
>>> 
>>> from textalignment import Smith_Waterman
>>> import pandas as pd
>>> text = "I am looking for John McNrow, where can I find John McAndRow?"    
>>> ##
>>> ## Recursive alignment
>>> ##
>>> sw = Smith_Waterman(a = text, b = "John McEnroe")
>>> m = sw.recursive_alignment(type = 'characters', which = 'both', threshold = 0.5)
>>> m = sw.recursive_alignment(type = 'characters', which = 'both', threshold = 0.5, method = 'default')
>>> m = sw.recursive_alignment(type = 'characters', which = 'both', threshold = 0.5, method = 'biopython')
>>> similarities = sw.as_data_frame(m, threshold = 0.7)
>>> sw = Smith_Waterman(a = text, b = ["John McEnroe", "McEnroe John", None])
>>> m = sw.recursive_alignment(type = 'characters', which = 'both', threshold = 0.5)
>>> m = sw.recursive_alignment(type = 'characters', which = 'both', threshold = 0.5, method = 'default')
>>> m = sw.recursive_alignment(type = 'characters', which = 'both', threshold = 0.5, method = 'biopython')
>>> similarities = sw.as_data_frame(m, threshold = 0.7)
>>> similarities = similarities[['a', 'a_from', 'a_to', 'b_similarity', 'b_aligned', 'a_aligned']]
>>> substring(text, start = list(similarities['a_from']), stop = list(similarities['a_to']))
['John McNro', 'John McAndRo']

Methods

Name Description
optimal_alignment Use smith_waterman to find the optimal alignment between a and b
recursive_alignment Use smith_waterman_recursive to find the optimal alignment between a and b in a recursive fashion
as_data_frame Returns the matched data as a pandas data frame with a similarity (b_similarity) above a certain threshold

optimal_alignment

smithwaterman.Smith_Waterman.optimal_alignment(
    type='characters',
    match=2,
    mismatch=-1,
    gap=-1,
    lower=True,
    tokenizer=None,
    collapse=None,
    edit_mark='#',
    method=None,
    **kwargs,
)

Use smith_waterman to find the optimal alignment between a and b

See Also

smith_waterman : smith_waterman

Returns

Name Type Description
dict | List[dict] a list of dictionary elements as returned by smith_waterman or a list of these dictionaries in case b is a list

recursive_alignment

smithwaterman.Smith_Waterman.recursive_alignment(
    threshold=0.5,
    which='both',
    type='characters',
    match=2,
    mismatch=-1,
    gap=-1,
    lower=True,
    tokenizer=None,
    collapse=None,
    edit_mark='#',
    method=None,
    **kwargs,
)

Use smith_waterman_recursive to find the optimal alignment between a and b in a recursive fashion

See Also

smith_waterman_recursive : smith_waterman_recursive

Returns

Name Type Description
dict | List[dict] a list of dictionary elements as returned by smith_waterman_recursive or a list of these in case b is a list

as_data_frame

smithwaterman.Smith_Waterman.as_data_frame(data=None, threshold=0)

Returns the matched data as a pandas data frame with a similarity (b_similarity) above a certain threshold

Returns

Name Type Description
pd.DataFrame a pandas data frame with the results of the alignment(s), containing columns a, b, sw, similarity, matches, mismatches, a_n, a_aligned, a_similarity, a_gaps, a_from, a_to, a_fromto, b_n, b_aligned, b_similarity, b_gaps, b_from, b_to, b_fromto