Searching for Multiple Strings
Suppose you have an application that needs to search for all occurrences of multiple strings in a text stream. For example, you might want to search a large text file for all occurrences of “cycling”, “bicycling”, “bicycle”, “cycle”, “bike”, and other bicycling-related terms. This can be a more difficult task than it first appears.
The simple case
If the file is small, the task isn’t terribly difficult. You can read the entire file as a string into memory and then use standard string searching methods to find each string, as shown here.
private static void FindAllSimple(string textToSearch, string[] searchStrings)
{
// brute force method
foreach (var s in searchStrings)
{
int iPos = 0;
int foundPos;
while (iPos < textToSearch.Length && (foundPos = textToSearch.IndexOf(s, iPos)) != -1)
{
Console.WriteLine(“{0},{1}”, foundPos, s);
iPos = foundPos + 1;
}
}
}
That code walks through the source text calling IndexOf repeatedly for each string to find all occurrences.
Note that when a match is found it increments by just one character rather than by the length of the string.
The reason is that we want all occurrences. So if the text contains “ababababababa” and we’re searching for “aba”, we’ll get:
ababababababa
aba
aba
aba
aba
aba
aba
If we were to increment by the length of the found string, our results would be quite different:
ababababababa
aba
aba
aba
It’s possible to speed things up a little bit by determining how much we could increment for each search string.
For example, if we’re searching for “aba”, then we could add two to the found position because we know that if we found “aba” at position 10, then position 11 can’t be the start of the next occurrence because by definition that position contains the letter “b”. Such an optimization can be very effective if the strings are long and if it’s expensive to search intervening characters. But for searching relatively short strings in memory, the string preprocessing overhead takes more time than the optimization saves.
You might also be able to use regular expressions to solve this simple case. For example, if you were looking for “bike” or “bicycle”, you could construct a regular expression and call the Match method repeatedly to find all occurrences. That seems like a win because you then only have to go through the source string one time. But it has a serious drawback: it misses things.
Consider searching for all occurrences of “ab” and “aba” within the string “ababababababa”. The method shown above works very well in this case, but the regular expression solution wouldn’t work at all. It would find all occurrences of “aba”, but it wouldn’t find any of the “ab”. Granted, you could special case overlapping strings, but I think you’d find it more trouble than it’s worth.
How fast is it?
The time required by the brute force string search is proportional to the size of the source text and the number of strings searched for. If N is the number of strings you’re searching for and M is the time it takes to search for one string, then the time it takes for all strings is N * M. There is some overhead for finding a match, and the length of the strings you’re searching for plays a role, but N * M is a good rough approximation. Finding 100 strings should take 10 times as long as finding 10 strings.
I ran a test on my system using a five megabyte text file, searching for multiple strings in the source text. The results show that N * M is a pretty good indicator.
| Number of strings | Time |
| 10 | 106 ms |
| 100 | 888 ms |
| 200 | 1,735 ms |
| 1,000 | 8,943 ms |
There’s a little bit of initial overhead, but after that it’s pretty close. 200 strings takes almost exactly twice as long as does 100, and 1,000 strings takes 10 times as long.
It’s also true that if you double the length of the text, the time to search doubles. Finding all occurrences of 1,000 strings in 10 megabytes takes almost 18 seconds.
So if N * M holds true, then how long would it take to find all occurrences of 10 strings in a gigabyte? If it takes 106 ms to find those strings in 5 megabytes, then:
106 ms / 5 Mb = ?? ms / 1024 Mb
(1024 Mb) * (106 ms / 5 Mb) = 21,708 ms (~ 22 seconds)
And if you want to find all occurrences of 1,000 strings in that gigabyte, it would take you 22,000 seconds, or about six hours.
Remember, we’re doing all of this in memory. Imagine what the times would be if you were searching through a 10 gigabyte file. Considering that it takes 15 or 20 seconds to read a gigabyte from disk, it would take you close to two minutes per string to find all occurrences. Clearly, we need a more efficient way to search for multiple strings in a large text archive.
