The Perils of XPath Expressions (Specifically, Escaping Quotes)

Posted on June 28th, 2007 by kushal


The other day, I was grappling with a particularly irritating problem with XPaths. I was using SelectSingleNode to dig some info out of an XML document.

The problem:

… was simple. Escaping a single/double quote in an XPath expression such as this:

string myXPathExpression =
    "books/book[@publisher = 'publisher name here']";

If the publisher name were to have an apostrophe in it (e.g. O' Reilly) I’d be in trouble.

Lazy Hack #1:

The simple, straightforward solution would be the following:

string myXPathExpression =
    "books/book[@publisher = \"O'Reilly\"]";

… i.e. enclose the PredicateExpr in double quotes instead of single quotes.
But of course as is often the case, words like "simple" and "straightforward" are merely a replacement for words like "short-sighted".

The problem with that solution of course, was what if that blasted publisher name had a double quote in it?
Would I go back to enclosing it in single quotes? What if it had both? What if I simply didn’t know, and I was building up the string like this:

string myXPathExpression =
    "books/book[@publisher = '" + publisherName + "']";

.. assuming publisherName was a user-entered string I had no control over. (which was in fact, the case)

Lazy Hack #2:

I could of course, wimp out and prevent the user from entering double or single quotes (or worse, both). I could even rationalise it by pretending this was really because I was thinking of the "bigger picture" and that resources and time aren’t really worth fixing this issue. But I decided not to. Mostly because its irritating enough listening to pseudo-managerial-cop-out-speak when it isn’t coming from me; I really didn’t need to add to it.

Wrong Solution Lazy Hack #3:

My first thought was that I should replace single quotes with ' (or its hex equivalent ') and double quotes with " (or ") according to the XML 1.0 markup rules. That should have worked right?

But apparently that isnt the case. Even though the guys at W3C recommend it.

It turns out that I didn’t need to escape any of the standard XML entities1 in my XPath query at all. (Even though I positively do need to do this in my XML markup)

So not only is this a valid XPath expression:

string myXPathExpression =
    "tvshows/tvshow[@name = 'Starsky & Hutch']";
    //no need to use & in place of ampersand.

… but also this would not return the result I would expect:

string myXPathExpression =
    "tvshows/tvshow[@name = 'Starsky & Hutch']";
    // this will *not* return the tvshow node with an attribute
    //called "Starsy & Hutch"

Solution:

It turned out the only solution was to use the concat function defined in the W3C XPath recommendation.

string myXPathExpression = "books/book[@publisher = " +
   "concat('Single', "'", 'quote. Double', '"', 'quote.')]";
   //looks for a publisher called Single'quote. Double"quote

i.e. break up my search string around single and double quotes, and concatenate all the bits using this concat function (it takes a variable number of string arguments) – thereby enclosing the single quotes in double quotes, and the double quotes in single quotes.

Pretty crazy, huh? BTW, this is true in .Net, Java2, Mozilla’s implementation of XPaths, as well as Internet Explorer’s. (In IE, you would be using the MSXML parser. More on this below).

So, since I was building up a string like this:

string myXPathExpression =
    "books/book[@publisher = '" + publisherNameHere + "']";

I had no alternative but to write a method that would generate the required concat function call for me. i.e.:

string myXPathExpression = "books/book" +
  "[@publisher = " + GenerateConcatForXPath(publisherNameHere) + "]";

Here is the method written in C#.

GenerateConcatForXPath
//you may want to use constants like HtmlTextWriter.SingleQuoteChar and
//HtmlTextWriter.DoubleQuoteChar intead of strings like "'" and "\""
private static string GenerateConcatForXPath(string a_xPathQueryString)
{
    string returnString = string.Empty;
    string searchString = a_xPathQueryString;
    char[] quoteChars = new char[] { '\'', '"' };
 
    int quotePos = searchString.IndexOfAny(quoteChars);
    if (quotePos == -1)
    {
        returnString = "'" + searchString + "'";
    }
    else
    {
        returnString = "concat(";
        while (quotePos != -1)
        {
            string subString = searchString.Substring(0, quotePos);
            returnString += "'" + subString + "', ";
            if (searchString.Substring(quotePos, 1) == "'")
            {
                returnString += "\"'\", ";
            }
            else
            {
                //must be a double quote
                returnString += "'\"', ";
            }
            searchString = searchString.Substring(quotePos + 1,
                             searchString.Length - quotePos - 1);
            quotePos = searchString.IndexOfAny(quoteChars);
        }
        returnString += "'" + searchString + "')";
    }
    return returnString;
}

The Exception (there’s always one):

Microsoft’s MSXML parser (the COM implementation, not the .Net one – and they are different) is still widely in use. Mostly in Visual Studio 6 based apps (like VB6), on apps with client-side XML processing done on IE, and those glorified batch files written in Windows Scripting Host. Also, there are probably more than a few .Net apps using MSXML via the COM Interop Services.

This problem of escaping quotes exists for MSXML too of course, and the solution is the same – but only for MSXML4 and later. For versions 3 and before, you would have to escape single and double quotes with C-style backslashes.
This naturally also means that you would have to escape backslashes themselves with two backslashes – something you need to be aware of if you are porting your application from MSXML 1, 2 or 3 to anything later than that.

Sigh! Sometimes I miss the old XPath-free days when shoot’em ups were still innovative, they actually ran on two megabytes of RAM, and no-one had heard of Paris Hilton.

1 Predefined XML Entities: &, <, >, " and '
2 XPaths in Java: I tested it using Apache’s Xalan XSLT Processor. And using the compile method which of course adheres to Sun’s JAXP specification.

Posted in C#, Java, XML | 27 Comments »

The Null Coalescing Operator (Or how to make Default values sound frightening)

Posted on June 15th, 2007 by kushal

C#

C# 2.0 introduced a little known, and somewhat useful new operator called the Null Coalescing Operator.

Its like the ternary conditional operator, except less powerful (but admittedly a little neater to look at).
Here’s an example of coaless coolesc that new feature:

//assuming formValue is of type string
string nickName = formValue ?? "Dr. Zoidberg";

… which is the same as this:

string nickName = 
        (formValue == null) ? "Dr. Zoidberg" : formValue;

Its just easiest to think of it as the ‘default’ operator. i.e.
nickName is being set to formValue, but with a default.

Note however, that if you try to change this code:

string nickName = 
        string.IsNullOrEmpty(formValue) ? "Dr. Zoidberg" : formValue;

… to sprinkle some freshly-made coalescing goodness, you could be introducing a subtle bug. (Think empty string)

SQL

I’ve never quite understood why people have to come up with the most intimidatory name possible for a simple feature.
Maybe the C# developers wanted to stress similarity with the ANSI SQL function which pretty much does the same thing:

SELECT COALESCE(@nickaname, 'Dr. Zoidberg')

… in which case I can somewhat understand. After all, the SQL guys had to spend their time dealing mostly with simplistic sounding keywords like SELECT, CREATE, UPDATE etc … and some guy probably just snapped. Lawyers have their indictments, plaintiffs, subpoenas and what-not. Doctors regularly get to say words like haemoglobin, pericardium and streptokinase. So someone must have looked up the dictionary and come up a random word.

Javascript

Interestingly enough, even though this feature isn’t supported by Java (as of Java 5), Javascript has long supported this. Of course Javascript really has nothing to do with Java. But its hard not to form an association in one’s head.
Anyway, here’s the equivalent in Javascript:

var nickName = (formValue || "Dr Zoidberg");

While on the topic of Javascript and null coalescence, beware though. Don’t get confused with this Javascript statement:

var returnValue = (myObject && myObject.myProperty);

…which is called the “Guard” operator apparently. You would use this when you really want to return myObject.myProperty, but you aren’t sure if myObject is null or not, and want to avoid a null pointer error1. Kinda hacky, I know.

If you’re wondering how come all this doesn’t conflict with Javacript’s implementation of the logical OR and AND operators, its because they dont necessarily return booleans and Javascript evaluates all objects, non-empty strings and non-zero numbers to true. So both the “guard” and “default” operators are really Javascript’s own peculiar implementation of logical AND and OR operators.
Javascript often strikes me as the Ferris Bueller of programming languages. Not always taken seriously, but still surprisingly inventive and most of all – very, very annoying.

1: The specific error message varies from browser to browser. In IE this would show up as “myProperty is null or not an object”, in Mozilla based browsers the error message would be “myObject has no properties” (which makes a little more sense, no?)

Posted in C#, Javascript | 6 Comments »

Archives

Categories

Blogroll