Thursday, December 13, 2007

Splitting Hairs (Or Comma Separated Values) in Java ...

A recent bug in my Java SMTP client led me down the fun path of figuring out how to conditionally split a string of email-addresses using commas in Java.

Since Sun has deprecated Tokenizers, the following RegEx Java split on a comma separated string normally does the trick:

Java:
recipientsArr = recipientsStr.split("\\,");

However, what if you only want to split if commas are not inside quotes or double quotes? Hrmm... tricky.

For example, I want to split the following string to create 2 valid email addresses, not 3 invalid ones:

"Leon, J" <j@email.com>, "M" <j@email.com>

We'll need some fancy Regular Expression goodness. Trouble is, I am not that great with RegEx grammar. Luckily, Neal Ford is. Very good. In his tutorial about "Power" regexes, he has a great example on how to conditionally match a comma if it's not inside quotes:
RegEx:
,(?=([^']*'[^']*')*(?![^']*'))

So, all that's left is to change it to look for double quotes, and make sure the Java compiler escapes the quotes properly.
Final Solution:
recipientsArr = recipientsStr.split( ",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))" );