Searching Texts — But Fast!
Today I will explain lesson 6 in Algorithms Unplugged book which talks about how to search text in two ways:
- Naive Algorithm
- Boyer — Moore — Horspoll Algorithm
As usual, we will start with the less efficient naive one because it’s easier to understand. Let’s consider we have a text denoted by (t) = “Haystack with a needle” and we want to search the text for a word denoted by (w) = “needle” We want to return the position of this word on the text.
The normal way we will loop through every character in the text and do another loop on the word itself to check if all the characters from the beginning of the text to the end of the word are the same and then we are assured that it happens and we can return the index of the first character in text to be the position in which word located.
This is a simple solution that will tend to a high number of comparisons we can do another simple solution that will make us nearer to the optimal Boyer algorithm by only making our comparisons start from the end of the word not from the beginning.
To simplify it more we will draw an image of how it works and after that, we can start to implement the code.
The first image shows the two strings (t) and (w) as tables each table consists of a character and its index. The pos variable will be used to help us indicate our current position and by the equation in the image, we will get the last index of our current position that has the same length as (w) so in our case (w) has a length of six so we will start with six to be able to compare the two strings from their ends, not their beginning.
Then we will start the other while loop which will compare the current (t) character with the current (s) character if they are equal the loop will continue and in each iterate, we will have a temporary value that holds the length of (w) that will be decreased by one inside this loop. so in our case which is shown in the image the while loop will not be executed at all because (e!== a).
The position will be increased by one and we will get the following which is a case similar to the above image.
We will continue shifting our pos until it is 17 in this case needle is found and our nested while loop will be executed iteratively and decrease our temporary variable that holds the length of (w) until it becomes zero and in this case, we are ensured that we have a match in this position.
So let’s implement the code I think it’s now will be more understandable.
function naiveSearch(w,t){
let pos = 1;
let n = t.length; //length of text we search in
let m = w.length; // length of word we want to found
// loop until the length of word exceeds the remain unsearched text
while(pos <= (n-m+1)){
let j = m; // temp variabe hold the length of word
//loop as long as character in w equal to t
while(j>0 && w[j] === t[pos+j-1]){
j = j - 1 // decrease the temp length of word
}
if(j === 0){
// in case that temp length reach zero then we found the word
console.log(`${w} occurence at position ${pos}`)
}
// update pointer
pos++;
}
}
naiveSearch("needle","Haystack with a needle");
//output : needle occurence at position 17
So if we notice we have done many unnecessary comparisons by doing this algorithm because, from the first comparison, we can shift our pos by 6 and go directly to index 12 and start from there because a in the text is not found by any mean in the word we search for so we can skip any substring in t that contain a and guarantee that we haven’t missed any comparison.
How can we do such a thing well this is the turn of the Boyer Algorithm which tends to store all the words in the text in an object in which the key is the character and the value is how much I can shift the pos if there is any mismatch.
So in case the character in the text is not found in the word you searched for then you can shift pos by the length of the word.
but what if this is not the case and I have a character that is in the word I search for example I compare l in the index of 21 with e it’s not equal but I can’t shift in this case by the length of the word as if I do so I will lose some comparisons that may contain the string so what to do? well, you will shift only the pos by one because if you do so you will guarantee that you make the l in the word string right below the l in the text string.
To calculate how much we will shift for each character you will find that if we have a word like cat. t will make the pos shift by the length of the string so it will be 3 but a will only make it shift by 1 and c will make it shift by 2.
From the previous example, we notice that the last character will have the same shift magnitude as any other character but starting from the next to last we will begin to count normally.
So we will do a for loop at first that will store all the characters in the text and suppose that they all are not found in the word we search for so we will have an object consisting of the following:
D = {
"H": 6,
"a": 6,
"y": 6,
"s": 6,
"t": 6,
"c": 6,
"k": 6,
" ": 6,
"w": 6,
"i": 6,
"h": 6,
"n": 6,
"e": 6,
"d": 6,
"l": 6
}
Then after that, we will have another loop that checks only for the characters in the word I search for and manipulates the object D we have created above with the new values for each character in our word except the last one.
D = {
"H": 6,
"a": 6,
"y": 6,
"s": 6,
"t": 6,
"c": 6,
"k": 6,
" ": 6,
"w": 6,
"i": 6,
"h": 6,
"n": 5,
"e": 3,
"d": 2,
"l": 1
}
The keys that have changed are at the bottom of the object and follow the same concept we discussed before but the difference here is that e which is our last character repeated so it will be updated to be 3 ne(e)dle if you count descendingly from the first character n with m-1 we will have:
n → 5 (m-1) , e → 4 , e → 3 , d → 2 , l →1 therefore e at the end will be 3
Finally, we can implement the code as you can see below:
function boyerMoore(w,t){
let D ={}; //save the table of
let m = w.length;
let n = t.length;
for(let i = 0; i < t.length ; i++){
D[t[i]] = m
}
for(let i = 0 ; i < m-1 ; i++){
D[w[i]] = m-i-1
}
let pos = 1; // pointer detect current position(character)
// loop until the length of word exceeds the remain unsearched text
while(pos <= (n-m+1)){
let j = m; // temp variabe hold the length of word
//loop as long as character in w equal to t
while(j>0 && w[j] === t[pos+j-1]){
j = j - 1 // decrease the temp length of word
}
if(j === 0){
// in case that temp length reach zero then we found the word
console.log(`${w} occurence at position ${pos}`)
}
// update pointer
pos+= D[t[pos+m-1]]
}
}
boyerMoore("needle","Haystack with a needle");
//output : needle occurence at position 17
This approach will make us decrease our comparisons from 24 comparisons with the naive solution to 11 comparisons with the Boyer solution and that’s a huge difference in performance.
So that’s it for today we have discussed how to search a string in an optimal way next lesson is the famous depth-first search algorithm so stay tuned😉.