4 String Matching Algorithms (Part 1): BF, KMP, SUNDAY, SHIFT-AND, Hash

Some classic and unparalleled algorithms related to single pattern matching

Course Content#

Related Definitions

Single Pattern Matching Problem: Finding whether a certain pattern string $t$ appears in the text string $s$

Multi-pattern Matching Problem: Finding multiple pattern strings $t$

Text String / Mother String: The string being searched $s$

Pattern String: The string to be searched $t$

The length of the text string is $m$, and the length of the pattern string is $n$

Brute Force Matching Method#

The cornerstone of all string matching algorithms

Idea#

Align each character of the pattern string with the text string one by one until a complete match is found
This idea is the foundation for understanding all matching algorithms: finding the answer without missing any
- The core of other algorithm optimizations is to avoid redundancy, reducing repeated and impossible matching attempts
Time Complexity: $O(m \times n)$

Provoking Thoughts#

How does brute force matching handle the first mismatch?

It will match the first character of the pattern string with the second character of the mother string
If the match is successful, it means the first character of the pattern string is equal to the second character of the text string
- Since it is known that the second character of the pattern string matches the second character of the text string
- It must be that the second character of the pattern string is equal to the first character
Whether the second character of the pattern string is equal to the first character can actually be known when obtaining the pattern string
For the above image, there is no need to attempt to match the first character of the pattern string with the second character of the mother string
- They definitely do not match
Is it possible to skip positions that definitely do not match? The KMP algorithm has its own ideas.

KMP#

State machine thinking, concept of self-structure repetition

Thought Process#

Continuing from above, focus on positions A, B, C

Assuming a mismatch occurs at the black shadow area, the pattern string $t$ moves forward by the size of ④, at this point the text string $s$'s ② and the pattern string $t$'s ① match successfully
Key: The left boundary of part ③ determines the distance $t$ moves forward [same as the right boundary of part ①]
A: In this vertical column, there must exist ②=③=①, with ② adjacent to the mismatch position 👉 ③ also adjacent to the mismatch position
B: ③ is actually a suffix of $t$, and ① is a prefix of $t$ 👉 ③ equals some prefix of $t$
C: To ensure no matches are missed, ④ should be as small as possible 👉 ③ should be as large as possible
Conclusion: In summary, through ③, the matching process can be accelerated, and ③ is the longest prefix of $t$ adjacent to the mismatch position

Problem Transformation#

Pre-obtaining the size of ③ corresponding to the mismatch also gives the size of ①

Find the ① and ③ of the pattern string, using the green characters after ① to match the orange characters at the mismatch position after ②
Consider each character of the pattern string: because the mismatch position could be any character of the pattern string
The pattern string is known in advance; the text string is unknown and uncontrollable
👉 The $next$ array of the pattern string

Next Array#

$next[i]$: Used when the $i$ position matches successfully, but the $i+1$ position mismatches, as the offset for the next attempt to match the pattern string
State Definition: The index of the longest prefix that can match in the suffix ending at position $i$ [right boundary]
- Only for the pattern string
Transition Formula: Understand by looking at the diagram, focusing on points ① and ② in the diagram
- $next[i+1]=\left{\begin{aligned}&next[i]+1&t[i+1]=t[next[i]+1]\&next[next[i]]+1&t[i+1]\neq t[next[i]+1]\ &&\ t[i+1]=t[next[next[i]]+1]\&next[next[next[...[i]]]]+1&until\ equal\ or\ the\ end\end{aligned}\right.$
- First look at the second and third lines, judging ①
  - If $t[i+1]=t[next[i]+1]$, then the longest prefix corresponding to $i+1$ is $next[i]+1$
  - Otherwise, move $t$ to the right as short as possible [similar to the ④ in the thought process above], which is to find the maximum prefix corresponding to the suffix ending at $j$ in the substring $t[0\to j]$ [this is the definition of the $next$ array]
- Observe the first and second lines, judging ②
  - If $t[i+1]=t[next[j]+1]$, then the longest prefix corresponding to $i+1$ is $next[j]+1$, where $j=next[i]$
  - Otherwise, continue to move $t$ to the right, judging ③④... until the equality condition is satisfied, or $t$ can no longer be moved to the right
Involves the idea of dynamic programming, can refer to "Interview and Written Test Algorithms" — 1 From Recursion to Dynamic Programming (Part 1)
Practice Understanding
- Define a virtual position: -1 [when no prefix matches, the value is -1]
- Practice 1
- Practice 2
- Answer: -1, -1, 0, -1, 0, 1, 2, 3, 4, 5, 6, 1
[PS] Refer to Detailed Explanation of KMP Algorithm — CSDN
- In the text, the $next$ array is defined as: the maximum common length of prefixes and suffixes [excluding the string itself, otherwise the maximum length is always the string itself]
- It can be seen that the definition of the $next$ array is not unique
- The concept of self-structure repetition unfolds, and this idea is used when calculating the $next$ array

Matching Process#

Traverse each character of the text string $s$ one by one
Start matching from the beginning of the pattern string $t$
1. If the current position of the text string $s$ is equal to the current position of the pattern string $t$, then match the next character of the pattern string $t$
2. Otherwise, based on the $next$ array, continuously move the matching position of the pattern string $t$ forward until the equality condition is satisfied, match the next character of the pattern string $t$; or if it can no longer be moved forward, match the next character of the text string $s$
When successfully matching to the end of the pattern string $t$, output the corresponding matching position of the text string $s$; or indicate that it was not found

Summary#

Understand the importance of part ③ in the pattern string [see thought process]
- The core of speeding up the matching process, avoiding a large number of useless matching attempts
The core to ensure no matches are missed: part ③ matches the longest prefix of the pattern string
Can be understood from the perspective of state machines, see code demonstration — State Machine Version of KMP
Time Complexity: $O(m + n)$
[PS] KMP also has some optimization methods

SUNDAY#

Find the tail first, then find the head

Simulation Operation#

Directly simulate the following situation

① When a mismatch occurs, observe the next position of the last character of the pattern string $t$ corresponding to the mother string $s$ — golden alignment point
- [Note] Do not observe the position after the mismatch <don't be misled by the image>
② In the pattern string $t$, find the first position of e from the back
- Found e at the second to last position
③ The second to last position means that the attempt to match the pattern string $t$ with the mother string $s$ moves forward by 2 positions, and then starts matching the mother string $s$ again
- [Personal Understanding] If this right shift is to match successfully, the e position must align, because it will always pass through e; finding the first e is to avoid missing the answer
④ The above image still results in a mismatch, at this point find the first a in the pattern string $t$ from the back, align it, match... until successfully matched or the matching ends
[PS] If the character corresponding to the golden alignment point cannot be found, the right shift offset will be the length of the entire pattern string $t$ + 1

Process#

Preprocess the last occurrence of each character in the pattern string
Simulate the brute force matching algorithm process, when a mismatch occurs, the text string pointer moves back according to the above preprocessing information, then re-matches
Finally, return the index if matching is successful, or indicate that it was not found

Summary#

Core: Understand the golden alignment point [see simulation operation]
- If matching is successful, it must appear in the pattern string
- It should align with the last occurrence of that character in the pattern string
Right shift operation: The position of the character found in the pattern string indicates how many positions to move the text string pointer to the right
Fastest Time Complexity: $O(m / n)$
- Corresponding scenario: If the character to find the golden alignment point is not found in the pattern string, the text string pointer will move back the entire length of the pattern string $n + 1$
[PS]
- Worst-case time complexity: $O(m \times n)$, but this scenario generally has no practical significance
- Usage scenario: Solving simple single pattern matching problems such as "finding a word in an article"
- Compared to KMP
  - SUNDAY is easier to understand and implement
  - But the thinking of the KMP algorithm is more advanced and should be learned [State Machine]

SHIFT-AND#

A very elegant niche algorithm that tests imagination
State machine thinking: NFA, Non-deterministic Finite Automaton

Encoding Array d#

⭐ Convert the pattern string into a kind of encoding, after conversion, the matching problem has nothing to do with the pattern string at all

Starting from position 0, read each character of the pattern string $t$ one by one
When the character 'a' appears at position 0, set the element at index 'a' in the $d$ array to 1
When the character 'e' appears at position 1, set the element at index 'e' in the $d$ array to 1
And so on, obtaining the $d$ array
[PS]
- The index of the $d$ array can correspond to the ASCII code of the character
- The elements of the $d$ array can store int type [32 bits], long long type, or custom types, which determines the maximum length of the pattern string that can be matched
- Note the sequence of characters and the numerical sequence of the $d$ array elements, the character at position 0 corresponds to the lower bits of the latter

State Information p#

Understand the state definition and feel the transition of states

State Definition: $p_i$
- In the binary representation of $p_i$, if the $j$th bit is 1, it means that when ending with the text string $s[i]$, the first $j$ characters of the pattern string can be successfully matched
- Diagram, compare the above definition with the three pairs of circles below
  - Red: $p_4$ represents the current matching condition ending with the text string $s[4]$
  - Orange: The second bit of $p_4$ is 1, indicating that the first 2 characters of the pattern string $t$ can satisfy the above matching condition, i.e., $t[0\to 1]$ matches $s[3\to 4]$
  - Green: The fifth bit of $p_4$ is 1, indicating that the first 5 characters of the pattern string $t$ can satisfy the above matching condition, i.e., $t[0\to 4]$ matches $s[0\to 4]$
  - [PS]
    - The binary length of $p$ depends on what data type is used, like the elements of the $d$ array, such as int: 32 bits
    - Do not focus on how $p_4$ is obtained for now, we will look at the transition process later
- ⭐ It can be seen that
  - $p$ simultaneously stores information about multiple matching results❗
  - When the $n$th bit of $p_i$ is 1, it indicates a successful match, that is
    - $p_i\ &\ (1\ <<\ (n-1))=1$
    - The matching position is: $i-n+1$
Transition Process: $p_i=(p_{i-1}\ <<\ 1\ |\ 1)\ &\ d[s[i]]$
- ⭐ Key Formula ⭐
- At this point, the pattern string $t$ has been completely discarded, and through the previous state of $p$ and the $d$ array, the current value of $p$ can be obtained
- Principle Analysis, compare the key formula with the following diagram ①②
  - Assuming $p_3$ is known, the current matching can match $s[2\to 3]$ and $s[0\to 3]$
  - ① Left shift by 1 + or 1 operation, resulting in the transitional state $p_4^{'}$
    - Left shifting by 1: Attempting to swallow $s[4]$ based on the previous match, i.e., matching $s[2\to 4]$ and $s[0\to 4]$
      - The condition for successful matching lies in the subsequent and operation
      - [Note] The left shift operation appears as a right shift in the diagram because the order of digits is low bits first
    - Or 1 operation: Leave a chance to match $s[4]$ alone, matching successfully represents matching this one character
  - ② And operation
    - The above two operations are just attempts to match, while the actual success depends on the performance of $d[s_4]$
    - If the result of the and operation at the $j$th bit is 1, it indicates a successful match of $s[4-j+1\to 4]$ with $t[0\to j-1]$
      - For example, if $p_4[2]=1$, $p_4[5]=1$, the matching result is as illustrated in the state definition part
[PS]
- The initial value of $p$ is 0, and $p$ is actually a non-deterministic finite automaton
- Habitual Conflict: Generally, the right side of the number sequence is the low bit, but in the diagram, the left side of $p$ is the low bit; the lowest bit of binary numbers is 1, while the lowest bit of strings is 0; the elements of the $d$ array are numbers, so they should correspond to the number sequence, but the above diagram is to reflect the corresponding position relationship with the pattern string, using the string sequence
- Accurately understand the meaning of $p$: The last successful match was 4 bits, and the next time it may successfully match 5 bits
- For matching problems with more than 64 bits, you need to create your own data type
  - Just support left shift, or, and operations
  - Then modify the types of the encoding array $d$ and the state information $p$ accordingly

Summary#

① Encoding Array $d$: Converts the position of each character in the pattern string into corresponding binary encoding
② State Information $p$: Transitions states through the key formula $p_i=(p_{i-1}\ <<\ 1\ |\ 1)\ &\ d[s[i]]$
Time Complexity: $O(m)$
- Does not require back and forth movement of the pattern string
- However, strictly speaking, it is not purely $O(m)$, this is related to the data representation capability of the computer
❗ Can solve regular expression matching problems
- For example: $(a|b|c)&(c|d)&e&(f|a|b)$
- ① Convert the above pattern string into the encoding array $d$
  - For each column of the $d$ array, multiple bits can be set to 1
  - That is, each character of the pattern string can set multiple corresponding binary bits of the $d$ array elements to 1
- ② After that, it is unrelated to the pattern string, solving the problem based on the $d$ array and the key formula
- [PS] Here it can be seen that the reason why the pattern string is called a pattern string is that it not only refers to a string but can also be a regular expression (which can also be seen as a kind of multi-pattern with relevance)

Code Demonstration#

Brute Force Matching Method#

Two versions, learn the techniques of the concise version, flexibly use $flag$
Traverse the text string and the pattern string respectively, comparing each character; if there is a mismatch, move the text string pointer one position to the right
This idea is the foundation of all string matching algorithms

KMP#

2 programming methods, the latter is closer to the essence of the algorithm

① Ordinary coding: Directly obtain the $next$ array, then use the $next$ array for mismatch offset
- Preprocess the $next$ array → Traverse each character of the text string → Match the pattern string [using the $next$ array]
- ❗ Observe: The process of obtaining the $next$ array is very similar to the process of matching the text string, based on the idea of state machines, the two parts can be integrated
② Advanced programming: Based on the state machine idea, transform the process of obtaining the $next$ value into a state change process
- Abstracted a state machine model: $j$ points to the position of the state machine, and the $getNext$ method is equivalent to performing a state transition based on the input character, which is essentially changing the value of $j$
- A way of thinking that is closer to the essence of KMP
PS: The $next$ array should also be freed before returning $i-n+1$ when a match is successful

SUNDAY#

❗ Note the meaning of 256: Used to record the position of each character in the pattern string
The process of preprocessing the $offset$ array: First initialized to indicate that no characters have appeared, $n+1$ indicates that the text string pointer can directly move right by a distance of one pattern string length + 1; during preprocessing, the positions of later identical characters will overwrite earlier ones, which aligns with the intention of SUNDAY
$s[i + n]$: The character corresponding to the golden alignment point
$i + n <= m$: Reasonably utilizes the $n$, $m$ variables, indicating that the text string is still sufficient for the pattern string to match
Compared to KMP, SUNDAY is easier to understand, and the code is easier to implement [the best choice for solving simple single pattern matching problems]

SHIFT-AND#

⭐ Beautiful idea, simple code
Note
- The $d$ array elements related to numbers and state $p$, the lowest bit is 1; while the string corresponds to 0
- But the transition process is an operation between numbers, so there is no need to worry about whether the positions are aligned

In-Class Exercises#

——Hash——#

Common problem-solving routine

HZOJ-275: Rabbits and Rabbits#

Sample Input

aabbaabb
3
1 3 5 7
1 3 6 8
1 2 1 2

Sample Output

Yes
No
Yes

Thought Process
- Note: The number of queries is particularly large, and the query range is very large, brute force matching will definitely time out
- How to compare quickly? 👉 Hash matching algorithm
  - By hashing, represent the string with numbers <hash value>
  - If the hash values of the two strings are different, the two strings must not be equal
    - This eliminates most of the unequal cases without needing to compare bit by bit
  - Otherwise, verify the equality of the strings through brute force matching
    - If the two strings are equal, one match can succeed
  - [PS]
    - The hash value corresponding to long strings may be too large, so generally, the hash value is reduced by taking the modulus
    - If the remainders are different 👉 the strings must be different
- Design a hash function [string -> hash value]
  - $H=(\sum_{k=0}^{n-1}{C_k\times base^k})%P$
  - $n$: Length of the string $s$
  - $C_k$: ASCII code value corresponding to the $k$th character
  - $base$: Weight [prime number]
  - $P$: Modulus [prime number]
  - [PS] Characters are stored in binary numbers at the computer's bottom level
- ⭐ In this problem, the hash value $H_{j\to i}$ of a substring $s[j\to i]$
  - Based on the relationship between the string and the hash value, a prefix sum array based on hash can be preprocessed
    - $HS_i=(\sum_{k=0}^{i}{C_k\times base^k})%P$
  - Thus, the hash value of the substring $s[j\to i]$ based on hash is
    - $HS_{j\to i}=(\sum_{k=j}^{i}{C_k\times base^k})%P$
  - However, the hash value of the substring should not include the positional information $j$, $i$ of the substring in the string, that is, the multiplier factor should start from $base^0$
  - Therefore, the interval sum $HS_{j\to i}$ needs to be divided by $base^j$, but for the modulus operation, division cannot be done directly and requires the use of the inverse element
  - Finally, the hash value of the substring $s[j\to i]$ is converted to
    - $H_{j\to i}=HS_{j\to i}\times(base^j)^{-1}%P=(HS_i-HS_{j-1})\times(base^j)^{-1}%P$
    - $(base^j)^{-1}$: The inverse element of $base^j$
    - For the quick calculation of the inverse element, please refer to: Quick Calculation of [Mod] Inverse Element
- Solution Method ①
  - 1. Preprocess the hash prefix sum for each position in the text string, convenient for later calculating the interval sum
  - 1. Use the interval sum and the inverse element to obtain the hash value of a substring $s[i\to j]$:
    - $H_{j\to i}=HS_{j\to i}\times(base^j)^{-1}%P$
  - 1. Compare the hash values of the two substrings
    - If they are not equal, the substrings must be different
    - Otherwise, verify their equality through brute force matching
  - ❗ The probability of two substrings being different but having equal hash values is $1/P$
- Solution Method ②
  - Based on Method ①, design two hash functions with different $P$ and $base$
  - Compare the two hash values corresponding to the two substrings
    - If both are equal, consider the two substrings the same
    - Otherwise, they must be different
  - ❗ The probability of two substrings being different but having equal hash values is $1/(P_1\times P_2)$
    - The probability of error is extremely low; in fact, many algorithms are probabilistic algorithms
Code
- Method ①: Hash Matching + Brute Force Matching
- ① The value of $P$ must be a prime number; otherwise, some inverse elements will not be meaningful, leading to incorrect hash values
  - This can cause the same substring to correspond to different hash values
  - The sufficient and necessary condition for the existence of the inverse element: being coprime
- ② It is necessary to use long long type, as the intermediate values may be very large
  - This increases the probability of different substrings corresponding to the same hash value, increasing the time consumption [brute force matching]
- [PS]
  - Always remember to take the modulus to prevent numbers from exceeding the representation range, even for long long
  - Ensure that the hash value is positive
    - The hash prefix sum array $H$ is not monotonically increasing because there are modulus operations during summation
  - A small $P$ value may lead to timeouts due to excessive brute force matching
- Method ②: Hash Matching * 2
- Note: The two inverse element arrays should be initialized separately to ensure complete initialization [the range of traversal is different]
  - Or use the same $P$ and different $base$, which can be initialized together
- [PS] For safety, a brute force matching verification can also be added

Quick Calculation of [Mod] Inverse Element#

⭐Derivation Process

Goal: Solve for $x\times x^{-1}\equiv1\ (mod\ P)$ in which $x^{-1}$
- [PS] Only consider the case where $x<P$, because all $x$ greater than $P$ can be converted to less than $P$ by taking the modulus
Let: $P\ %\ x=r,P\ /\ x=k$
Then:
- $\begin{aligned}P&=kx+r\kx+r&\equiv0\ (mod\ P)\kx(x^{-1}r^{-1})+r(x^{-1}r^{-1})&\equiv0\ (mod\ P)\kr^{-1}+x^{-1}&\equiv0\ (mod\ P)\x^{-1}&\equiv-kr^{-1}\ (mod\ P)\\end{aligned}$
Ultimately, the problem of finding $x^{-1}$ is transformed into finding $r^{-1}$
- And $r$ is the remainder of $P$ divided by $x$, so $r$ must be smaller than $x$
A larger problem 👉 a smaller problem [recursion]
To ensure the inverse element is positive, one more process is done
- $x^{-1}_{+}=(-kr^{-1}\ %\ P+P)\ %\ P$
[PS]
- The modular inverse element: The sufficient and necessary condition for the existence of the inverse element is that $p$ and $x$ are coprime
- There are many methods to find the inverse element, refer to Popularization of Inverse Element Knowledge (Elementary Version) — cnblogs

Code

Regardless of the modulus, the inverse element of 1 is 1 [initialization]
Output Result
- Can verify: The original number multiplied by the inverse element modulo is 1

Additional Knowledge Points#

The essential ideas of KMP and SHIFT-AND: State machines
The SUNDAY algorithm is commonly used in daily life, and the Hash matching algorithm is commonly used in problem-solving
DFA and NFA are knowledge from compiler principles
- Main difference: DFA is faster, NFA consumes less memory
- Refer to Difference between DFA and NFA — Stechies

Tips#

Before starting C++, make sure to finish the preparatory course
Becoming an insider in the field of algorithms relies on improving your aesthetic ability
Getting angry is a sign of incompetence; focus on doing the right thing! Improve step by step, without seeking instant success