What is the Smith-Waterman similarity measure in NLP?

The Smith-Waterman similarity measure is used in NLP to find the local similarity between two strings. It’s commonly used in bioinformatics to find similar gene sequences, but has also found usage in NLP for tasks like multiworded keyword search and finding similarity in erroneous strings, where conventional approaches like bag-of-words (BOW) might not be effective.

Smith-Waterman algorithm

The Smith-Waterman algorithm is a dynamic programming algorithm employed to calculate the Smith-Waterman similarity measure. Its utility is in its flexibility to penalize differences in the strings being compared.

For two strings a=a1a2...ama = a_1a_2...a_m and b=b1b2...bnb = b_1b_2...b_n, the Smith-Waterman algorithm is as follows:

  1. Define the similarity score of the algorithm as s(x,y)={s1if x=ys2elses(x, y) = \begin{cases} s_1 & \text{if}~x=y \\ s_2 & \text{else}\end{cases}, where xx and yy are two string characters.

  2. Define the gap penalty of the strings by an array WkW_k, where kk is the length of the gap. By assigning a negative score of our choice to the gaps in the sequences, we can control how the gaps influence the similarity score.

  3. Create a zero matrix MM of size m×nm\times n.

  4. Fill the entries of MM using the following equation:

Here, 1im1\leq i \leq m, 1jn1\leq j \leq n, and kk and ll are the lengths of the gaps at aia_i and bjb_j, respectively.

  1. The greatest element of MM is the Smith-Waterman similarity measure.

The time complexity of the Smith-Waterman algorithm of two strings is O(m2n)O(m^2n) and the space complexity is O(mn)O(mn), where mm and nn are the lengths of the strings.

Code example

Let’s explore the implementation of the Smith-Waterman similarity measure using the py_stringmatching library.

from py_stringmatching.similarity_measure.smith_waterman import SmithWaterman

str1 = "Alexis"
str2 = "AliBaba"
sw = SmithWaterman(gap_cost=1.5)

smith_waterman_similarity = sw.get_raw_score(str1, str2)

print("Similarity function: ", sw.get_sim_func())
print("Smith-Waterman Similarity of \"{}\" and \"{}\" = {}".format(str1, str2, smith_waterman_similarity))
Calculating the Smith-Waterman similarity measure for two strings

Explanation

  • Line 1: We import the Smith-Waterman similarity measure class from the py_stringmatching library.

  • Line 5: We instantiate an sw object of the Smith-Waterman similarity measure class.

  • Line 7: We then measure the Smith-Waterman similarity of the str1 and str2 strings.

  • Lines 9–10: The similarity function and the Smith-Waterman similarity of the strings are printed.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved