The Smith-Waterman similarity measure is used in NLP to find the local similarity between two strings. It’s commonly used in bioinformatics to find similar gene sequences, but has also found usage in NLP for tasks like multiworded keyword search and finding similarity in erroneous strings, where conventional approaches like bag-of-words (BOW) might not be effective.
The Smith-Waterman algorithm is a dynamic programming algorithm employed to calculate the Smith-Waterman similarity measure. Its utility is in its flexibility to penalize differences in the strings being compared.
For two strings
Define the similarity score of the algorithm as
Define the gap penalty of the strings by an array
Create a zero matrix
Fill the entries of
Here,
The greatest element of
The time complexity of the Smith-Waterman algorithm of two strings is
Let’s explore the implementation of the Smith-Waterman similarity measure using the py_stringmatching
library.
from py_stringmatching.similarity_measure.smith_waterman import SmithWaterman str1 = "Alexis" str2 = "AliBaba" sw = SmithWaterman(gap_cost=1.5) smith_waterman_similarity = sw.get_raw_score(str1, str2) print("Similarity function: ", sw.get_sim_func()) print("Smith-Waterman Similarity of \"{}\" and \"{}\" = {}".format(str1, str2, smith_waterman_similarity))
Line 1: We import the Smith-Waterman similarity measure class from the py_stringmatching
library.
Line 5: We instantiate an sw
object of the Smith-Waterman similarity measure class.
Line 7: We then measure the Smith-Waterman similarity of the str1
and str2
strings.
Lines 9–10: The similarity function and the Smith-Waterman similarity of the strings are printed.
Free Resources