One of the changes I made to my parser is how smileys are handled. Whereas the old parser had a character tokenizer which made writing the logic for parsing smileys tedious to say the least, I felt it would be worth it to attempt to build an auto regex builder script.
Basically I wanted to be able to convert this:
array(
':)' => 'happy',
':(' => 'sad',
':/' => 'dunno',
':@' => 'yelling',
...
Into:
/(\:(\(|\)|\*|/|@|D|O|P)|;\)|\>\:(\(|D|\[)|B\)|O\:\))/
Since we're dealing with a set of tiny string which, most of which with similar prefixes, my original parsing logic looked a lot like a tree, or state machine if you will. Seeing as a regular expression is basically just a fancy hard to read state machine, I figured it would be a pretty easy way to represent my smileys.
So here's what we ended up with:
Order smiley array so that all prefixes are in order
Create a recursive array definition of the smileys
Collapse the recursive array back into a regex
The intermediate representation looks like so:
array (
'\\:' =>
array (
'\\(' =>
array (),
'\\)' =>
array (),
'\\*' =>
array (),
),
';' =>
array (
'\\)' =>
array (),
),
...
You can see the tree structure taking place. Note the extra backslashes though, these are for escaping all PCRE special characters.
And now, for your enjoyment, the full algorithm.
/** @class RegexUtil
*/
class RegexUtil {
/**
* Convert an array of strings to a regular expression by finding common
* prefixes
*
* @param $array
*
* @see self::flattenRegexArray()
*
* @returns string regular expression enclosed in hash marks '#'
*/
public static function arrayToRegex(array $array) {
sort($array);
$regex = array();
foreach ($array as $match) {
$current =& $regex;
for ($i = 0; $i < strlen($match); $i++) {
$c = preg_quote($match[$i], '#');
if (isset($current[$c])) {
$current =& $current[$c];
}
else {
$current[$c] = array();
$current =& $current[$c];
}
}
}
return '#'. self::flattenRegexArray($regex) .'#';
}
/**
* Flatten an array tree representation of a regular expression into
* a regex string
*
* @param $regex array tree representation of a regular expression
*
* @see self::arrayToRegex()
*
* @returns string regular expression
*/
private static function flattenRegexArray(array $regex) {
if (!$regex) {
$re = '';
}
else if (count($regex) == 1) {
reset($regex);
$re = key($regex) . self::flattenRegexArray(current($regex));
}
else {
$re = '(';
foreach ($regex as $char => $nextRegex) {
$re .= $char;
if ($nextRegex)
$re .= self::flattenRegexArray($nextRegex);
$re .= '|';
}
$re = substr($re, 0, -1);
$re .= ')';
}
return $re;
}
}
Now I don't pretend that this is by any means the most efficient way of doing this, but it's something that's relatively simple and took only an hour or two to come up with.
Questions? Thoughts?




discount ugg boots calf height boot
features genuine twin-face sheepskin. All bootsugg classic cardy
ugg kids
the Classic Collection feature a soft foam insoleugg boots sale
covered with genuine sheepskin and flexible outsole designed for refreshing comfort with every step. With a reinforced heel, raw seams, and signature heelugg boots cheap
label are all aplenty of theugg boots bailey button
precision craftsmanship of discount ugg boots . The discount ugg boots is built to be durable yet comfortable, rugged yet soft, and classic yet stylish. This boots, a signature UGG silhouette, features genuine Twin face sheepskin for incredible comfort.ugg boots knit
The discount ugg bootsugg boots kids
is a ust have for any guy's wardrobe. This boot contains materials that are not waterproof. Experience the tension of the dayugg boots classic tall
release when you slip your feet into a pair of cheap discount ugg boots . The very original boots were disbursed by an Australianugg boots classic cardy
manufacturing company named UGG. These boots were released to the public known as Australian Sheepskin Boots.ugg classic tall
UGG also makes sheepskin slippers and other footwear.