The Perils of XPath Expressions (Specifically, Escaping Quotes)

Posted on June 28th, 2007 by kushal


The other day, I was grappling with a particularly irritating problem with XPaths. I was using SelectSingleNode to dig some info out of an XML document.

The problem:

… was simple. Escaping a single/double quote in an XPath expression such as this:

string myXPathExpression =
    "books/book[@publisher = 'publisher name here']";

If the publisher name were to have an apostrophe in it (e.g. O' Reilly) I’d be in trouble.

Lazy Hack #1:

The simple, straightforward solution would be the following:

string myXPathExpression =
    "books/book[@publisher = \"O'Reilly\"]";

… i.e. enclose the PredicateExpr in double quotes instead of single quotes.
But of course as is often the case, words like "simple" and "straightforward" are merely a replacement for words like "short-sighted".

The problem with that solution of course, was what if that blasted publisher name had a double quote in it?
Would I go back to enclosing it in single quotes? What if it had both? What if I simply didn’t know, and I was building up the string like this:

string myXPathExpression =
    "books/book[@publisher = '" + publisherName + "']";

.. assuming publisherName was a user-entered string I had no control over. (which was in fact, the case)

Lazy Hack #2:

I could of course, wimp out and prevent the user from entering double or single quotes (or worse, both). I could even rationalise it by pretending this was really because I was thinking of the "bigger picture" and that resources and time aren’t really worth fixing this issue. But I decided not to. Mostly because its irritating enough listening to pseudo-managerial-cop-out-speak when it isn’t coming from me; I really didn’t need to add to it.

Wrong Solution Lazy Hack #3:

My first thought was that I should replace single quotes with ' (or its hex equivalent ') and double quotes with " (or ") according to the XML 1.0 markup rules. That should have worked right?

But apparently that isnt the case. Even though the guys at W3C recommend it.

It turns out that I didn’t need to escape any of the standard XML entities1 in my XPath query at all. (Even though I positively do need to do this in my XML markup)

So not only is this a valid XPath expression:

string myXPathExpression =
    "tvshows/tvshow[@name = 'Starsky & Hutch']";
    //no need to use & in place of ampersand.

… but also this would not return the result I would expect:

string myXPathExpression =
    "tvshows/tvshow[@name = 'Starsky & Hutch']";
    // this will *not* return the tvshow node with an attribute
    //called "Starsy & Hutch"

Solution:

It turned out the only solution was to use the concat function defined in the W3C XPath recommendation.

string myXPathExpression = "books/book[@publisher = " +
   "concat('Single', "'", 'quote. Double', '"', 'quote.')]";
   //looks for a publisher called Single'quote. Double"quote

i.e. break up my search string around single and double quotes, and concatenate all the bits using this concat function (it takes a variable number of string arguments) – thereby enclosing the single quotes in double quotes, and the double quotes in single quotes.

Pretty crazy, huh? BTW, this is true in .Net, Java2, Mozilla’s implementation of XPaths, as well as Internet Explorer’s. (In IE, you would be using the MSXML parser. More on this below).

So, since I was building up a string like this:

string myXPathExpression =
    "books/book[@publisher = '" + publisherNameHere + "']";

I had no alternative but to write a method that would generate the required concat function call for me. i.e.:

string myXPathExpression = "books/book" +
  "[@publisher = " + GenerateConcatForXPath(publisherNameHere) + "]";

Here is the method written in C#.

GenerateConcatForXPath
//you may want to use constants like HtmlTextWriter.SingleQuoteChar and
//HtmlTextWriter.DoubleQuoteChar intead of strings like "'" and "\""
private static string GenerateConcatForXPath(string a_xPathQueryString)
{
    string returnString = string.Empty;
    string searchString = a_xPathQueryString;
    char[] quoteChars = new char[] { '\'', '"' };
 
    int quotePos = searchString.IndexOfAny(quoteChars);
    if (quotePos == -1)
    {
        returnString = "'" + searchString + "'";
    }
    else
    {
        returnString = "concat(";
        while (quotePos != -1)
        {
            string subString = searchString.Substring(0, quotePos);
            returnString += "'" + subString + "', ";
            if (searchString.Substring(quotePos, 1) == "'")
            {
                returnString += "\"'\", ";
            }
            else
            {
                //must be a double quote
                returnString += "'\"', ";
            }
            searchString = searchString.Substring(quotePos + 1,
                             searchString.Length - quotePos - 1);
            quotePos = searchString.IndexOfAny(quoteChars);
        }
        returnString += "'" + searchString + "')";
    }
    return returnString;
}

The Exception (there’s always one):

Microsoft’s MSXML parser (the COM implementation, not the .Net one – and they are different) is still widely in use. Mostly in Visual Studio 6 based apps (like VB6), on apps with client-side XML processing done on IE, and those glorified batch files written in Windows Scripting Host. Also, there are probably more than a few .Net apps using MSXML via the COM Interop Services.

This problem of escaping quotes exists for MSXML too of course, and the solution is the same – but only for MSXML4 and later. For versions 3 and before, you would have to escape single and double quotes with C-style backslashes.
This naturally also means that you would have to escape backslashes themselves with two backslashes – something you need to be aware of if you are porting your application from MSXML 1, 2 or 3 to anything later than that.

Sigh! Sometimes I miss the old XPath-free days when shoot’em ups were still innovative, they actually ran on two megabytes of RAM, and no-one had heard of Paris Hilton.

1 Predefined XML Entities: &, <, >, " and '
2 XPaths in Java: I tested it using Apache’s Xalan XSLT Processor. And using the compile method which of course adheres to Sun’s JAXP specification.

Posted in C#, Java, XML | 27 Comments »

Archives

Categories

Blogroll