Wednesday, November 14, 2007

The Joys of Regex

I had written a service as part of an application involving an old mainframe system. The mainframe would send me text that I was to turn into a pdf and save to database.

Shortly after the app went live, I started getting several transaction failures on my end. My logging displayed "FAILURE: Can't show character with Unicode value U+FFFD".

Since the data is stored in a varbinary field, I'm sure this was being generated when I was writing the pdf.

After talking to the mainframe guys, I found that they would sometimes send me strange unicode characters such as �.

I asked fro whichever would be easier, characters denied or characters allowed, and was sent:
~!@#$%^&*()_+`1234567890-=QWERTYUIOP{}|ASDFGHJKL:"ZXCVBNM<>?qwertyuiop[
]\asdfghjkl;'zxcvbnm,./

Well this looked like a job for Regular Expressions.
Now I am no regex expert having only used it for simple pattern matching, I began working at this in a similar manner. Match all of those characters and concatenate the matching strings with spaces.

After hammering away I came up with:

public static string SanatizeInput(string file)
{
Regex reg = new Regex("[\\r\\n\\sA-Za-z0-9~!@#$%^&*()_+`\\-={} |:\"<>?[\\]\\;',\\./]*", RegexOptions.Compiled);
Regex line = new Regex(".*");
Match lineMatch = line.Match(file);
string output = string.Empty;
while (lineMatch.Success)
{
string cleanLine = string.Empty;
Match m = reg.Match(lineMatch.Value);
while (m.Success)
{
foreach (Capture capture in m.Captures)
{
if(!string.IsNullOrEmpty(capture.Value))
cleanLine += capture.Value + " ";
}
m = m.NextMatch();
}
if(!string.IsNullOrEmpty(cleanLine))
output += cleanLine + "\n";
lineMatch = lineMatch.NextMatch();
}
return output;
}


Which grabbed each line, then would loop through all the matches and add them together and insert in newlines.
In the end it worked.

But I was unhappy. It was far too verbose, and just seemed wrong.
My first hint was using ^ to negate a collection of characters.
My second was using the regex replace as opposed to concatenating.

The better solution:
public static string SanatizeInput(string file){
return new Regex("[^A-Za-z0-9~!@#$%\\^&*()_+`\\-={} |:\"<>?[\\]\\;',./\\s]").Replace(file, " ");
}