Loading...

Regex builder

Submitted by: etlgfx
Updated: ; Posted:

One of the changes I made to my parser is how smileys are handled. Whereas the old parser had a character tokenizer which made writing the logic for parsing smileys tedious to say the least, I felt it would be worth it to attempt to build an auto regex builder script.

Basically I wanted to be able to convert this:

array(
        ':)' => 'happy',
        ':(' => 'sad',
        ':/' => 'dunno',
        ':@' => 'yelling',
...

Into:

/(\:(\(|\)|\*|/|@|D|O|P)|;\)|\>\:(\(|D|\[)|B\)|O\:\))/

Since we're dealing with a set of tiny string which, most of which with similar prefixes, my original parsing logic looked a lot like a tree, or state machine if you will. Seeing as a regular expression is basically just a fancy hard to read state machine, I figured it would be a pretty easy way to represent my smileys.

So here's what we ended up with:

  1. Order smiley array so that all prefixes are in order

  2. Create a recursive array definition of the smileys

  3. Collapse the recursive array back into a regex

The intermediate representation looks like so:

array (
  '\\:' => 
  array (
    '\\(' => 
    array (),
    '\\)' => 
    array (),
    '\\*' => 
    array (),
  ),
  ';' => 
  array (
    '\\)' => 
    array (),
  ),
...

You can see the tree structure taking place. Note the extra backslashes though, these are for escaping all PCRE special characters.

And now, for your enjoyment, the full algorithm.

/** @class RegexUtil
 */
class RegexUtil {
	/**
	 * Convert an array of strings to a regular expression by finding common
	 * prefixes
	 * 
	 * @param $array
	 *
	 * @see self::flattenRegexArray()
	 *
	 * @returns string regular expression enclosed in hash marks '#'
	 */
	public static function arrayToRegex(array $array) {
		sort($array);

		$regex = array();

		foreach ($array as $match) {

			$current =& $regex;

			for ($i = 0; $i < strlen($match); $i++) {
				$c = preg_quote($match[$i], '#');

				if (isset($current[$c])) {
					$current =& $current[$c];
				}
				else {
					$current[$c] = array();
					$current =& $current[$c];
				}
			}
		}

		return '#'. self::flattenRegexArray($regex) .'#';
	}

	/**
	 * Flatten an array tree representation of a regular expression into
	 * a regex string
	 *
	 * @param $regex array tree representation of a regular expression
	 *
	 * @see self::arrayToRegex()
	 *
	 * @returns string regular expression
	 */
	private static function flattenRegexArray(array $regex) {
		if (!$regex) {
			$re = '';
		}
		else if (count($regex) == 1) {
			reset($regex);
			$re = key($regex) . self::flattenRegexArray(current($regex));
		}
		else {
			$re = '(';
			foreach ($regex as $char => $nextRegex) {
				$re .= $char;

				if ($nextRegex)
					$re .= self::flattenRegexArray($nextRegex);

				$re .= '|';
			}
			$re = substr($re, 0, -1);
			$re .= ')';
		}

		return $re;
	}
}

Now I don't pretend that this is by any means the most efficient way of doing this, but it's something that's relatively simple and took only an hour or two to come up with.

Questions? Thoughts?