The XHTML as text/html Mess

Note: I want this document to be as impartial as possible (except where explicity noted). If you have any comments, additions, or questions I strongly recommend that you mail them to me (ian@hixie.ch) and ask that you cc the www-talk mailing list (www-talk@w3.org). Thanks.

The problems

There are several people who have several different (but similar) problems which all basically boil down to the same issue. The main goals that I recall are:

MathML

People want to send MathML embedded in HTML in such a way that suitably enabled browsers can view the equations correctly.

Mozilla are able to render MathML natively, and Windows IE has a plugin which allows it to render MathML. However, whereas Mozilla requires that the document be parsed as text/xml in order for it to recognise the MathML namespace and thus construct the right DOM, Windows IE requires that the document be parsed as text/html in order for it to construct its DOM.

(Note that you cannot, while complying to the spirit of current W3C technologies, send XHTML containing non-XHTML namespaced content as text/html. All XHTML content containing any mention of namespaces other than the xmlns attribute on the root <html> element are, of course, invalid XHTML documents to start with, but even taking that into account, section 5.1 of XHTML states that only documents that, by virtue of following Appendix C, are compatible with existing UAs may be sent as text/html. Documents containing namespaces are almost certainly not backwards compatible.)

Progress

People want to use XHTML (presumably because it is the "latest and greatest" version of HTML) without losing their target audience. The majority of deployed browsers do not support XHTML, but if the guidelines in Appendix C are followed, then valid XHTML is compatible with existing browsers and thus can be sent to them. The problem is that authors want modern browsers to still use their XML parser on these documents.

There is no normative reason why this would not be a valid thing to do. However, see my rebuttal below.

More Reliable CSS Styling

See this post.

Styling with XSL

To style a web page with XSL, the source has to be an XML document. The theory is that if those XHTML documents are sent with both XSL and CSS stylesheets and are sent as text/html, then non-XML browsers will be able to render the pages using the CSS, and browsers supporting XSL will be able to render the pages using XSL, treating the source as XML.

In practice, this is unlikely to prove a problem: first, XHTML is more easily styled using CSS than XSLT+XSL:FOs, and second, by the time any web browser supports XSL, XML support is likely to be on most desktops.

The real world

Content sent as text/html is not only HTML4, but also Tag Soup. Tag Soup is a defacto standard that is only superficially related to SGML. UAs absolutely have to support Tag Soup or they will never gain enough market penetration for their standards support to matter.

The solutions

There are several existing solutions.

  1. I have written a tool that will sniff for the browser version and send IE the MIME type text/html and other browsers text/xml. In fact, this very document is being processed by this script. Unfortunately, this requires access to the server.

  2. Don't use XHTML yet. If you need validation, do it on your side.

The proposals

Given a text/html entity, how might a suitably capable user-agent determine unambiguously and reliably that the author intended it to apply an XML parser to the contents?

People have proposed several ideas to make XML-aware UAs treat documents sent as text/html as text/xml.

Note, however, that the only XHTML documents that are allowed to be sent as text/html according to the spec are those conforming to Appendix C, and documents conforming to Appendix C will look identical whether treated as text/html or text/xml.

  1. Sniff for an XHTML DOCTYPE. Rebuttal: no algorithm has been proposed which could be executed fast enough (it is vital that page load times not be affected by sniffing for XHTML since this feature has such a small target audience compared to general fast surfing) while not stumbling on valid HTML documents.

  2. Sniff for an XML Declaration. Rebuttal: Appendix C says PIs should not be included in documents sent as text/html; they are also hard to parse correctly in a hurry.

  3. Sniff for an XHTML namespace declaration. Rebuttal: even harder to parse than an XHTML DOCTYPE.

  4. Only examine the file extension and treat .xhtml files as XML. Rebuttal: There are many files with .xhtml extensions which are plain old HTML.

  5. Magic Comment String. Rebuttal: it is "wrong".

The future (This is the explicitly biased section!)

Personally, I believe the way forward is for UAs to support XHTML (as some browsers are doing, WinIE being the major obstacle here), for these browsers to be distributed widely, and only then for XHTML to begin being used. There is no point running before we can crawl.

I'm still looking for a good reason to write websites in XHTML at the moment, given that the majority of web browsers don't grok XHTML. The only reason I was given (by Dan Connolly) is that it makes managing the content using XML tools easier... but it would be just as easy to convert the XML to tag soup or HTML before publishing it, so I'm not sure I understand that. And even then, having the content as XML for content management is one thing, but why does that require a minority of web browsers to have to treat the document as XML instead of tag soup? What's the advantage of doing that? And even then, if the person in control of the content is using XML tools and so on, they are almost certainly in control of the website as well, so why not do the content type munging on the server side instead of campaigning for UA authors to spend their already restricted resources on implementing content type sniffing?

Further reading

This thread in www-talk speaks about many of these issues.

Thanks

Arjun Ray contributed to this document.