Javascript Aho-Corasick String Search Algorithm

For the new Twitterfall Exclusions filter, we needed something that would efficiently search tweets for multiple substrings. The simplest solution would be to search through the tweet multiple times for each string, but this is dumb and inefficient, especially when you may have several tweets per second falling down. The next solution would be to craft a regular expression and use that, but that seemed inefficient to us and we’re computer scientists and we were sure there would be a better way.

After doing a bit of research, I stumbled across the Aho-Corasick algorithm, which is a string search algorithm that matches all patterns in a dictionary ‘at once’ by constructing a trie with sets of links from each node representing a string to the node corresponding to the longest proper suffix. It sounded like just what we needed.

I couldn’t find a Javascript implementation of the entire string search algorithm, so I combined Dennis Bryne’s Trie implementation and my own code to produce a Javascript implementation of this useful string search algorithm, great for filters. This is the string search function, which you can then combine with the Trie implementation on the afore-mentioned link.

// Aho-Corasick String Search Algorithm
// @author Jalada http://archive.jalada.co.uk
// Arguments:
//   trie - a Trie as per http://is.gd/1Y9FT
//   s - string to be searched
function ahoCorasick(trie, s) {
	// Start at the root.
	var current = trie;
	// Nothing in state at the moment
	var state = "";
	// Split the string
	var split = s.split("");
	var j = 0;
	// We return everywhere in this loop.
	while (1) {
		for (i=j; i<split.length; i++) {
			var r = current.hasChild(split[i]);
			// Does this character exist in the children of where we
			//are in the trie?
			if (r) {
				// If so, append to the state, and traverse to
				// that child
				state += split[i];
				current = r;
				// Have we found a word now?
				if (trie.getWordCount(state) != 0) {
					return true;
				}
			} else {
				// If not, go back to where we started to match,
				//reduce i, and empty the state
				current = trie;
				i = i - state.length;
				state = "";
			}
		}
		// Reached the end of the string
		if (state == "") {
			// Just found nothing
			return false;
		} else {
			// Was in the middle of finding something, so possibly
			//missed something, so go back to check.
			current = trie;
			j = i - state.length + 1;
			state = "";
		}
	}
}

I haven’t had much of a chance to make sure this is the most efficient way of coding this algorithm. But it seems to work fine, and in some rough testing I did it seemed 3-5x faster than a regular expression of the syntax /This|That|The other|Whatever/ in the circumstances I tested. If you use this, and find a way of improving on it, be sure to let me know so we can include it back in Twitterfall 🙂

Feel free to use this Javascript code. I’m releasing it under the MIT License.

This entry was posted in Blog, Noteworthy. Bookmark the permalink.