Rabin–Karp algorithm is a string-searching algorithm using rolling hash. Rolling hash can be made with O(1) by removing a first element from the previous hash and adding a next element to it. There are various hash functions which can do it, but for example, when simply summing up character codes, the hashes will collide just because the same character is contained, so normally, the string is transformed to number as follows. Overflow can be avoided even with long strings by taking the modulo.
#include<bits/stdc++.h>
using namespace std;
class rolling_hash {
const int base = 256;
const int mod = pow(10, 9) + 7; // prime modulus
string str;
int window_length;
long tail_p; // = pow(base,window_length)
long current_hash = 0;
int head_idx = 0;
public:
rolling_hash(string str, int window_length) {
this->str = str;
this->window_length = window_length;
long p = 1;
for (int i = window_length - 1; i>= 0; i--) {
current_hash = (current_hash + ((str[i] * p) % mod)) % mod;
if (i != 0) p = (p * base) % mod;
else this->tail_p = p;
}
}
int get() {
return current_hash;
}
// a_0 * base^n + a_1 * base^n-1 + ... + a_n * base^0
// => a_1 * base^n + ... + a_n * base^1 + a_n+1 * base^0
void roll() {
head_idx++;
current_hash = (current_hash - ((str[head_idx-1] * this->tail_p) % mod)) % mod; // a_1 * base^n-1 + ... + a_n * base^0
int tail_idx = head_idx+window_length-1;
if (tail_idx >= str.length()) throw new out_of_range("the window is out of range");
current_hash = (((current_hash * base) % mod) + str[tail_idx]) % mod; // a_1 * base^n + ... + a_n * base^n-1 + a_n+1 * base^0
}
};
Rabin–Karp algorithm searches for a substring where this hash matches. Hashes can collide so string comparison is needed after that.
int index_of(string str, string substr) {
auto str_hash = rolling_hash(str, substr.length());
auto substr_hash = rolling_hash(substr, substr.length());
for (int i = 0; i < str.length() - substr.length(); i++) {
if(str_hash.get() == substr_hash.get() && str.substr(i, substr.length()) == substr) return i;
str_hash.roll();
}
return -1;
}
int main(void){
cout << index_of("ABCDEF", "CDE") << endl; // 2
cout << index_of("ABEF", "BCD") << endl; // -1
}
Let n be length of whole string, and m be length of substring, this algorithm is O(n+m) on average, but in the worst case where all hashes collide so string comparison is done every time, it is O(nm) same as brute force. When searching for multiple substrings, the execution time does not depend on the number so it is used to detect plagiarism.