What is the Boyer-Moore algorithm?

Overview

In bioinformatics, pattern searching is an essential problem. We search for strings in different types of sequences using such algorithms.

The Boyer-Moore algorithm is a string-searching algorithm used for finding DNA sequences from a string of DNA. It was developed in 1977 by Robert S. Boyer and J. Strother Moore.

Example

Let's take an example of a sample text and a pattern that is to be detected.

Example of the string and pattern to be detected.
1 of 4

Algorithm

Let's assume the original string is TT and its length is nn. On the other hand, the string to be searched is PP , and its length is mm.

The algorithm begins the alignment at k=mk=m. This means that the start of PP aligns with the start of TT. Characters in TT and PP are compared starting at the index mm in PP and index kk in TT while moving backward.

The strings are compared from the end of PP to the start of PP in a backward manner. This comparison continues till the beginning of PP is reached, indicating whether a match or a mismatch occurs. Upon a mismatch, the alignment is moved forward according to the specified rule. These rules are the bad character rule and the good suffix rule.

Note: To learn more about the rules, visit this Answer.

Comparisons are then performed at the new alignment, which is repeated until the alignment is shifted past the end, indicating no further matches will occur.

Code sample

This code sample uses the bad character rule.

number_of_characters = 256
def bad_character_rule(string, size):
#set all occurences as -1
bad_character = [-1]*number_of_characters
#set the value of the last occurrence
for i in range(size):
bad_character[ord(string[i])] = i;
return bad_character
def search(original_string, search_string):
n = len(original_string)
m = len(search_string)
bad_character = bad_character_rule(search_string, m)
s = 0
while(s <= n-m):
j = m-1
#reduce index of search_string while the serch_string and original_string match
while j>=0 and search_string[j] == original_string[s+j]:
j -= 1
#if search_string is found at s, set j as -1
if j<0:
print("search_string present at index = {}".format(s))
s += (m-bad_character[ord(original_string[s+m])] if s+m<n else 1)
else:
s += max(1, j-bad_character[ord(original_string[s+j])])
def main():
original_string = "ACTGGATGT"
search_string = "TG"
search(original_string, search_string)
if __name__ == '__main__':
main()

Code explanation

In main.py:

  • Lines 3–11: We use the bad_character_rule function to preprocess the bad character rule.

  • Lines 13–32: We use the search function for searching the pattern that makes use of the bad character for the given search_string.

Conclusion

The bad character rule has the runtime O(n/m)O(n/m) in the best case and O(nm)O(nm) in the worst case. The worst case occurs when the character of the original string and the string to be found are the same. The best case occurs when all the characters of the original string and the string to be matched are different.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved