The Perils of XPath Expressions (Specifically, Escaping Quotes)

Posted on June 28th, 2007 by kushal


The other day, I was grappling with a particularly irritating problem with XPaths. I was using SelectSingleNode to dig some info out of an XML document.

The problem:

… was simple. Escaping a single/double quote in an XPath expression such as this:

string myXPathExpression =
    "books/book[@publisher = 'publisher name here']";

If the publisher name were to have an apostrophe in it (e.g. O' Reilly) I’d be in trouble.

Lazy Hack #1:

The simple, straightforward solution would be the following:

string myXPathExpression =
    "books/book[@publisher = \"O'Reilly\"]";

… i.e. enclose the PredicateExpr in double quotes instead of single quotes.
But of course as is often the case, words like "simple" and "straightforward" are merely a replacement for words like "short-sighted".

The problem with that solution of course, was what if that blasted publisher name had a double quote in it?
Would I go back to enclosing it in single quotes? What if it had both? What if I simply didn’t know, and I was building up the string like this:

string myXPathExpression =
    "books/book[@publisher = '" + publisherName + "']";

.. assuming publisherName was a user-entered string I had no control over. (which was in fact, the case)

Lazy Hack #2:

I could of course, wimp out and prevent the user from entering double or single quotes (or worse, both). I could even rationalise it by pretending this was really because I was thinking of the "bigger picture" and that resources and time aren’t really worth fixing this issue. But I decided not to. Mostly because its irritating enough listening to pseudo-managerial-cop-out-speak when it isn’t coming from me; I really didn’t need to add to it.

Wrong Solution Lazy Hack #3:

My first thought was that I should replace single quotes with ' (or its hex equivalent ') and double quotes with " (or ") according to the XML 1.0 markup rules. That should have worked right?

But apparently that isnt the case. Even though the guys at W3C recommend it.

It turns out that I didn’t need to escape any of the standard XML entities1 in my XPath query at all. (Even though I positively do need to do this in my XML markup)

So not only is this a valid XPath expression:

string myXPathExpression =
    "tvshows/tvshow[@name = 'Starsky & Hutch']";
    //no need to use & in place of ampersand.

… but also this would not return the result I would expect:

string myXPathExpression =
    "tvshows/tvshow[@name = 'Starsky & Hutch']";
    // this will *not* return the tvshow node with an attribute
    //called "Starsy & Hutch"

Solution:

It turned out the only solution was to use the concat function defined in the W3C XPath recommendation.

string myXPathExpression = "books/book[@publisher = " +
   "concat('Single', "'", 'quote. Double', '"', 'quote.')]";
   //looks for a publisher called Single'quote. Double"quote

i.e. break up my search string around single and double quotes, and concatenate all the bits using this concat function (it takes a variable number of string arguments) – thereby enclosing the single quotes in double quotes, and the double quotes in single quotes.

Pretty crazy, huh? BTW, this is true in .Net, Java2, Mozilla’s implementation of XPaths, as well as Internet Explorer’s. (In IE, you would be using the MSXML parser. More on this below).

So, since I was building up a string like this:

string myXPathExpression =
    "books/book[@publisher = '" + publisherNameHere + "']";

I had no alternative but to write a method that would generate the required concat function call for me. i.e.:

string myXPathExpression = "books/book" +
  "[@publisher = " + GenerateConcatForXPath(publisherNameHere) + "]";

Here is the method written in C#.

GenerateConcatForXPath
//you may want to use constants like HtmlTextWriter.SingleQuoteChar and
//HtmlTextWriter.DoubleQuoteChar intead of strings like "'" and "\""
private static string GenerateConcatForXPath(string a_xPathQueryString)
{
    string returnString = string.Empty;
    string searchString = a_xPathQueryString;
    char[] quoteChars = new char[] { '\'', '"' };
 
    int quotePos = searchString.IndexOfAny(quoteChars);
    if (quotePos == -1)
    {
        returnString = "'" + searchString + "'";
    }
    else
    {
        returnString = "concat(";
        while (quotePos != -1)
        {
            string subString = searchString.Substring(0, quotePos);
            returnString += "'" + subString + "', ";
            if (searchString.Substring(quotePos, 1) == "'")
            {
                returnString += "\"'\", ";
            }
            else
            {
                //must be a double quote
                returnString += "'\"', ";
            }
            searchString = searchString.Substring(quotePos + 1,
                             searchString.Length - quotePos - 1);
            quotePos = searchString.IndexOfAny(quoteChars);
        }
        returnString += "'" + searchString + "')";
    }
    return returnString;
}

The Exception (there’s always one):

Microsoft’s MSXML parser (the COM implementation, not the .Net one – and they are different) is still widely in use. Mostly in Visual Studio 6 based apps (like VB6), on apps with client-side XML processing done on IE, and those glorified batch files written in Windows Scripting Host. Also, there are probably more than a few .Net apps using MSXML via the COM Interop Services.

This problem of escaping quotes exists for MSXML too of course, and the solution is the same – but only for MSXML4 and later. For versions 3 and before, you would have to escape single and double quotes with C-style backslashes.
This naturally also means that you would have to escape backslashes themselves with two backslashes – something you need to be aware of if you are porting your application from MSXML 1, 2 or 3 to anything later than that.

Sigh! Sometimes I miss the old XPath-free days when shoot’em ups were still innovative, they actually ran on two megabytes of RAM, and no-one had heard of Paris Hilton.

1 Predefined XML Entities: &, <, >, " and '
2 XPaths in Java: I tested it using Apache’s Xalan XSLT Processor. And using the compile method which of course adheres to Sun’s JAXP specification.

22 Responses

  1. Bruce Walters Says:

    Wonderful analysis of dealing with apostrophe’s in XPath. I grappled with this for several hours before finding your page. It seems crazy that there isn’t an easier solution but I tried many things and was running out of ideas. Thanks for the research.

  2. kushal Says:

    Glad it came it came in useful Bruce.
    & thanks for the comment.

  3. Jim Says:

    Thank you, thank you, thank you, thank you, thank you, thank you, thank you – you saved my sanity.

    Apostrophes have always been curse in every aspect of programming, I can’t believe the qpath spec couldnt have just allow '??!

  4. kushal Says:

    Hi Jim,
    You’re welcome :)
    I guess they couldn’t just allow single quotes because SGML, HTML, XML all had already established the convention of allowing either single or double quotes as delimiters.
    Incidentally, I came across this post by Joel Spolsky which suggests a convention for making encoding related problems more conducive to being caught. I’ve never implemented this (yet), but I think its a very good suggestion all the same.
    Kushal

  5. mark Says:

    This is brilliant, thanks. There are so many other sites out there that have got incorrect solutions to this problem. This has really helped me out, nice work.

  6. kushal Says:

    Hi Mark,
    Glad it could be of help.
    Kushal

  7. James Hutchison Says:

    That’s excellent!

  8. andre Says:

    again, thank you for writing this nice and clear summary of this extremely confusing issue …
    this page needs to show up at the very top of google’s results for “xpath escaping quotes”, so that developers hitting this problem in the future don’t waste so much time on it (and can instead focus on stuff like an innovative re-invention of the shoot’em up genre ;-) ).

  9. kushal Says:

    Hi Andre,
    I’m glad you found it useful. :)
    Kushal

  10. Igs Says:

    Awesome!!! I spent a few hours pulling my hair out trying to figure how to do this in .NET. Very clearly explained with several options that I thought of myself, and neat code.

    Thanks

  11. Ryan Says:

    Thanks for the comprehensive article on a tricky subject – it gave me a great starting point.

    I’ve adapted your GenerateConcatForXPath function as its using concat when not necessary (e.g. string contains either ” or ‘ but not both) and generates strings with more arguments to concat than is needed.

    See this for more details

    http://stackoverflow.com/questions/642125/encoding-xpath-expressions-with-both-single-and-double-quotes

  12. Bob Deng Says:

    This helps a lot for me, Thanks

  13. Tarun Says:

    Can’t stop but being thankful to u..

  14. Iain Dooley Says:

    Great write up, thanks so much for posting. I’ve added a solution in PHP to my blog:

    http://workingsoftware.com.au/page/Escaping_single_and_double_quotes_in_XPath_queries_in_PHP

  15. kushal Says:

    Hey Iain, Glad it was useful for you :-)
    Kushal

  16. Stephen Gross Says:

    Thanks! Works like a charm!

  17. XML Search Issue - Bizzteams Says:

    [...] Haven't figured out the magic string yet. But supposedly xpath gets ugly when quotes are involved kushalm.com Reply With [...]

  18. Mila Says:

    Thanks! used your solution for my webdriver scripts and it worked!

  19. Programowanie w PHP » Blog Archive » Working Software Blog: Escaping single and double quotes for use with XPath queries in PHP Says:

    […] I’ve been working with the Basecamp API to plugin our IRC bot that we use for time tracking and I’m astounded to learn that escaping single and/or double quotes for XPath queries in PHP does not have a well documented, best practices solution. In fact, it seems as though this is not peculiar to PHP. I took a look around and found this excellent article by “Kushal”: http://kushalm.com/the-perils-of-xpath-expressions-specifically-escaping-quotes. […]

  20. Encoding XPath Expressions with both single and double quotes | Ask & Answers Says:

    […] Adds more arguments to the Concat operation than is necessary e.g. would return //review[@name=concat('Fred', "'", 's ', '"', 'Fancy Pizza', '"', '')] […]

  21. Jay Says:

    C# string extension method

    public static string EncodeTextWrappedIn(this string text, char endQuote)
    {
    var replaceTemplate = “E+DED+E”;
    char encapsulatingQuote = (endQuote == ‘”‘) ? ‘\” : ‘”‘;
    replaceTemplate = replaceTemplate.Replace(‘D’, encapsulatingQuote);
    replaceTemplate = replaceTemplate.Replace(‘E’, endQuote);
    return text.Replace(endQuote.ToString(), replaceTemplate);
    }

  22. Steve Says:

    Excellent solution, still very valid all these years later. Thanks

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Posted in C#, Java, XML | 22 Comments »

Archives

Categories

Blogroll