XML Is A Quarter Century Old

For those of us who have spent entire careers working with structured data, it comes as something of a surprise to be reminded that XML is now 25 years old. You probably missed the XML standard on the 10th of February 1998, but it’s almost certain that XML has touched your life in many ways even if you remain unaware of it.

The idea of one strictly compliant universal markup language to rule them all was extremely interesting in an era when the Internet was becoming the standard means to interchange information and when the walled gardens dating back to the mini- and mainframe era were being replaced with open standards-based interchange. In the electronic publishing industry, it allowed encyclopedia and dictionary-sized data sets to be defined to a standard format and easily exchanged.  At a much smaller level, it promised a standard way to structure more mundane transactions. Acronyms and initialisms such as WAP, SOAP, and XHTML were designed to revolutionize the Web of the 21st century, but chances are that those are familiar only to the more grizzled developers.

In practice the one-size-fits-all approach of XML left it unwieldy, giving the likes of JSON and HTML4 the opening to be the standards we used. That’s not to say XML isn’t hiding in plain sight though, it’s the container for the SVG graphics format. Go on — tell us where else XML can be found, in the comments!

So, XML. When used to standardise large structured datasets it can sometimes be enough to bring the most hardened of developers to tears, but it remains far better than what went before. When hammered to fit into lightweight protocols though, it’s a pain in the backside and is best forgotten. It’s 25 years old, and here to stay!

Header: [Jh20], GFDL v1.2.

58 thoughts on “XML Is A Quarter Century Old

    1. The overhead is almost completely solved by zipping it, which is common both for HTTP transfers and formats which save as xml, but unfortunately we’ve never had a “xmlz” format as a thing which APIs and editors open.

      1. When the solution to an issue involves throwing a lot of cycles/memory at removing something, then more resources at the other end in putting it back in only to require the software to wade through it to find the data again… well, it just sort of gives the impression that many uses of XML might have been less-than-well-thought-out. :-)

        Kidding aside, The XML approach makes a lot of sense when left in the context of the SGML roots that spawned it. My (and many, many other’s) personal hell began when the “structured” format was applied in areas where it only served to vastly increase data bloat and created almost comical complexity (I’m looking at you, GML). Add to that the incredible increase in processor, storage, and memory load that happens when you take a lot of numeric data and force it to be serialized and de-serialized to text (and then compressed/decompressed to save storage space and communication bandwidth). I’m talking the mult-megabyte-xsd-type stuff as well as simpler formats that require high-volume transactions. Just because we could, we didn’t think if we should. Oh, yeah… Namespace Hell. Human-readable and Human-comprehensible can be two very different things.

        Even in smaller roles like config-files, XML adds a level of verbosity that rarely seems to help clarity or comprehension (your mileage may vary, and probably is a lot better than mine).

        While it may not be warranted, I’ve ended up viewing XML with the same contempt formerly reserved for the DoD Ada Mandate.

        1. A lot of grief could have been spared if people hadn’t been so married to Vi and ancient consoles.

          HTML and XML should have simply allowed blobs with length, without the need for base64 data is not a problem. Vi&co and the old fossils who would only let it be torn from their cold dead hands were the problem, unfortunately not dead yet in this regard.

          1. Often it’s not even old timers insisting on these tools, but youngsters who associate these tools with being a “real programmer”.

            Currently folks are on a crusade to make everything fit into 80 characters, indentation and all. I once saw a stock grafana config file with a math formula broken into multiple lines to make it fit into 80 characters.

    2. Kind of funny for our profession how much we hate typing, hence all the complaints about verbosity and too many parenthesis. It’s why the Unix-haters book had so much fun with naming conventions for UNIX commands and them sounding like a stomach disorder.

      1. That book was awsome! :D
        And depending how you look at it, it makes perfectly sense.
        Which is kind of both amusing and confusing.

        Speaking of.. The technical term for the phobia of long words is..
        “Hippopotomonstrosesquippedaliophobie” – just kidding.
        It’s “Sesquipedalophobie”. Much better, isn’t it ? ;)

  1. I think there’s the biggest misunderstanding of the last 25 years in computing.

    XML is a document description language (actually a metalanguage). As such it is passable (bar a few botches, like the entity crazyness, which reaches from “unwieldy but harmless escaping mechanism” to “go out on the internet and fetch this document snippet through FTP” (srsly)).

    What’s great is the “ecosystem” of several structure validation languages (DTD, RelaxNG), access and transformation languages (XPATH, XSLT) and universal namespaces for things (catalogs).

    As a data description language it sucks rocks. It has been misused for that in horrible ways (you mention SOAP, a typical monster emerging from Microsoft’s dark underbelly).

    It’s this misunderstanding which leads to that idiocy “JSON is better than XML”. No. JSON is a data representation language. Try to encode a document with markup in JSON: you’ll end up re-implementing some weird form of XML.

    And don’t get me started with those calling a JSON-serialised data structure a “document”. I’ll apply Hanlon’s razor there.

    Hammer and screwdriver. You can drive a nail into a wall with a screwdriver (provided it’s big enough). But only we software people get away pulling such a stunt without being laughed at. Such is our trade’s state, sadly.

    There. That’s a pet peeve :)

    1. +1

      Let’s also don’t forget that the 90s were a dfiffeent time.
      Way more diverse and experimental. Sometimes more reasonable.

      For example, Windows 9x didn’t support Unicode yet.
      (The unicows library merely was an 1:1 wrapper)

      Text files, source code files and also websites were still
      being written using Windows codepages, DOS codepages (CP437 etc)
      or were using “HTML entities”.

      Does anybody remember them?

      For about any non-American letter, they were the only safe bet.
      Unicode support was still flakey, browers mainly used the system’s codepage as reference.
      If you visited foreign websites, you had to change your browser setting accordingly by hand.

      So we Germans for our beloved ß (sharp s), we had to write & szlig ;
      And & auml ; for ä (ae), & ouml ; for (oe), & uuml ; for ü (ue)..

      This of course, was quite handy sometimes.
      Using plain ASCII names was sometimes easier to remember.
      What do you mean by Scharf-S? What’s an Eszett ?!

      “Oh, that’s the awesome & szlig sign again! Yeah! Sure I do remember it..”

      :D

      1. Actually. Web pages was written in LATIN-1 or ISO-8859-1.

        Some perverts used MS Windows Code page or IBM:s code page.

        XML is great for verify data structures in advanced documents. It is great source to translate to other formats. It is easy to use in programs, and verify that the program works as it is supposed to, as the output can be verified. And all those document to verify or translate XML formated documents are using XML, like XML Schemas.

        And XML use encoding UTF-8, so we all can use any character used by humans, in same document, without encoding like in HTML written with it’s standard encoding LATIN-1.

        1. “Actually. Web pages was written in LATIN-1 or ISO-8859-1.”

          Well, yes and no. The jumping point is that Latin-1 and ISO are standards that may not be directly available in the operating system.

          Windows used Windows-1252 to display such encoded documents.
          Mac OS 8/9 tried to do the same with MacRoman.

          Also, there were multiple versions.
          ISO-8891-15/Latin-9 rivaled ISO-8859-1/Latin-1 in some way or another..

          And then there was the problem with the Euro sign.
          The HTML entity “& euro” always worked, if the underlying OS had the necessary fonts.

          Speaking of, there was an update released by Microsoft which contained new font files for Windows 3.1x and Windows 95, I believe. So users of IE3 or IE5 on those old systems weren’t left out.

          “Some perverts used MS Windows Code page or IBM:s code page.”

          The IBM codepages were often used by DOS and dos-based applications.

          Also, there was Arachne web browser. A variation was used in those TV setup boxes for internet access. DOS Latin 1 equaled CP850, the succesor to CP437 in Europe (introduced soon after MS-DOS 3.x)..

          The BIOS in an x86 PC used to be using CP437 only, too.
          It was an early extension to 7-Bit ASCII (the eight bit was a parity bit, originally), albeit it was not the only one.

          Oh, and I don’t like your language, by the way. So please stop being so disrespectful, that’s not a great achievement, anyway.
          It only makes you look immature. I hope you get this. Thanks.

          “And XML use encoding UTF-8, so we all can use any character used by humans, in same document, without encoding like in HTML written with it’s standard encoding LATIN-1.”

          Normally, it’s UTF-8, yes, but the encoding must or should be specified at the beginning of the XML ‘document’.

          Also, Windows NT historically used the predecessor of UTF-16, UCS-2..
          Considering NT’s historical value, using UTF-8 isn’t an ideal choice, either.

          The best thing about UTF-8 is perhaps it compatibility to 7-Bit ASCII. However, it would have been wiser if it also included CP437, not just the first 128 ASCII characters. Because, ASCII is very US-centric, sadly. It’s in its name, even.

          https://en.wikipedia.org/wiki/Code_page_437

          1. You don’t get it, do you?

            Standards doesn’t need to be the format stöd in programs.
            Fillers translated should use standards, like Latin-1 or UTF-8. But lovely you can use whatever format you want, whatever the OS uses.

            That is why we have standards. To have a common formats that we can transfer to and from different computers and OS and applications…

            Yes, I have used three web from 1992. Coded webb Pages and used web browser. I used CGI and Python sctipt back in 1995.

            Problem with Euro solved with ISO-8859-15 or even better with ISO-10646/Unicode.
            This is why one should never use anything but Unicode in new standards/applications.

            They is Latin-1 that is the base in Unicode, not ASCII, or ISO-646.
            When we compress Unicode, then we use UTF-8, which is based on ASCII. Because it is 7-bit character set.

            MS code pages are not real standards. ISO are real standards, so UTF-8 and UTF-16.

  2. I was writing technical documents in HPTag in 1990, so when HTML and XML appeared I was right at home. Roots in SGML dating back to the 70’s.
    The problem with XML is that people think it can be used to describe things. It can’t because it doesn’t itself describe the semantics of elements – despite people using words like “grammar” for XSDs etc. I’d better stop here before I start using words like “ontology”.

    1. > The problem with XML is that people think it can be used to describe things. It can’t because it doesn’t itself describe the semantics of elements – despite people using words like “grammar” for XSDs etc.

      Not sure what you meant by this exactly, but subsequent online searches regarding it did improve my understanding about XML so thanks!

      1. XML document can be well formed, without a XML Schema (as XSD) or validated format, where there are a XML Schema to use.

        XML isn’t that hard either. Writing Schemas could be a bit more complicated, but just well formad is easy.

        And yes, you can write XML for all structured data. You can’t use that as easy with most other popular formats that are used instead of XML.

  3. DITA, a markup language for technical documents, is basically a variety of XML. DITA looks a bit like HTML with its support for tables, bulleted lists and so on, but is much more precise. I spend a lot of time editing it for work, and running it through an XML syntax checker is a valuable sanity check. All the manuals for dynabook’s laptops are written in DITA, and if you’ve used one in a language other than English, it’s passed through my grubby little hands on the way.

    https://en.wikipedia.org/wiki/Darwin_Information_Typing_Architecture

  4. Let’s bury it and dance upon it’s grave. I remember the ‘buzz’ around it – XML was the solution to everything. All your data interchange woes – gone. So much time wasted on this crap. Remember BizTalk?

    JSON and protobufs are much much better. Granted they came quite a bit late, but better late than never.

    1. Protobufs blow. Their promised magical “versioning” tolerance is bunk, so their inscrutability isn’t worth it. Not to mention the need to compile for every language you use, assuming that there’s a compiler. Last time I looked, there wasn’t even a native Kotlin one. WTF? Protobufs are Google’s own invention, and Kotlin may as well be.

  5. i initially had a real dismissive attitude towards xml but now i appreciate it for a couple things…

    the first is, i met someone who was just a little older than me but had grown up in a non-structured-data environment (literally “punchcards”), and i happened to be there watching when he was exposed to the idea of xml and i realized there actually is an audience (and a vast institutional process) that is going to learn something new from xml.

    the second is, there are libraries that really do make it pretty convenient to make braindead xml parsers, especially in environments like android java. i generally feel like a lot of these libraries are overwrought, especially when simple parsers aren’t really that hard to write. but it really is convenient to just parse css with about a hundred lines of java code, knowing details like escaping and UTF and optional termination tags are handled correctly even though i didn’t explicitly consider them.

    something i’ve come to resent is that it’s useful enough that it’s been used. android programming is the experience of using an extremely verbose XML-based ui description language combined with an extremely verbose java-based programming language. both of them are perfectly usable and i even like them but it’s an enormous amount of pointless typing. or, to use the correct term, “cut and pasting”.

  6. 4 things that killed XML :

    Stupid use of namespaces in xslt rules making them unreadable.

    Non support for xslt2 in MS world.

    Handling of whitespace characters.

    Drop off support for local files in browser that make XML docs useless.

    1. Apache Xerces/Xalan, being Java-based, ran just fine on most platforms, including Windows.

      IBM had an XSLT2/XQuery optimizing compiler, which Websphere users could get as XTQHP. Unfortunately Websphere made the Liberty reimplementation an all-hands priority, shelving XML standards-related work.

      I believe at least one other implementation of the newer spec exists, also in Java.

      I periodically think about trying to recreate XTQHP’s concepts in an opensource form, either as extension of or replacement for Xalan-J. But given that Xalan-J is currently not far from being declared an inactive project and moved to the Attic, it’s unclear an audience exists to justify the effort of swapping all of this back into wetware. (Xalan-C is definitely being put in the attic; it was a very early port from the Java code and toward the end I was having trouble remembering how that branch worked well enough to denig it.)

  7. 25 only? I thought it was younger. I remember when I first heard about it at the University the prof was enthusiastic about it. He told us that even though the format is long-winded when compressed it will be shorter than binary format compressed. I am still not sure whether I should believe that or not. Since then I have implemented the compression of an XML protocol that transforms the list of decimal numbers into binary representation. Because effective storage turned out to be important in some cases. If only XML had a memory and CPU effective binary storage version I would like it even more.

    1. EXI exists, but hasn’t seen wide adoption yet. As far as compressed XML somehow being “more compact” than raw uncompressed binary data is concerned: it can be theoretically, but computation time and memory consumption becomes a constraint.

  8. Im an old guy, and hate XML. Writing a parser from scratch is a complete pain.
    I suggest we go back to the DOS .INI file structure. :-)
    Easy to read, easy to make changes, easy to add.

    keyword1=value
    keyword212=value
    keyword 8=value
    …….

    Search for your keyword, find the “=” and you are done.

    It’s interesting that commenters muse about the old days, but don’t go back much more than Windows XP (Which stall had some .ini files!)

      1. The registry is a giant monolithic database of binary data. As a side note: INI files wouldn’t work for webpage rendering. They aren’t structured enough, plus there has never been any kind of formal INI spec. The closest thing we have now is TOML, but that still likely wouldn’t work well as a markup language.

    1. INI and JAML and JSON are just a unstructured mess, without structure enough to unambigous encode what you really want.

      And if you don’t want to have a XML Schema, you can still check the syntax of the data file, and easy to read för programs. That is well formad data. And by proper use of attributes and elements (arguments and tags) will make it Compact.

      With a schema, which isn’t that hard to write, you can easy config Editor that always generate correct XML fillers for that format.

      And yes, write a XML Schema are easy to write, and with proper tools, they are easy to read and generate a tree data structure for the program to extract the configuration. And writing a datastructure to file in XML are even easier.

      And if one doesn’t want to read, there are som and easy to use libraries for XML.

  9. Many people don’t realize that XML is actually a subset of SGML, which is was a markup language designed by IBM and standardized by ISO. Back in the HTML4 days, W3C was pushing to make HTML fully compliant with SGML, but this effort was ultimately abandoned when HTML5 was developed independently and adopted. Part of me says that it would have been nice if XHTML really took off.

  10. The D-Bus interface description and security policy languages are XML. Not that I’ve needed to use the first yet, I’m lucky enough to use a binding where I can just define stuff in the code.

  11. We still use XML here for our control system database and config files. We never found a need for schema files and all all the malarkey they tacked onto the format… Just single plain o’ XML files with tags and some attributes. Simple. Easy to use, easy to read, easy to modify with a text editor if needed. As it is plain ascii, it is portable across OS platforms, and applications. If we need another field it was/is easy to add. If a tag (or attribute) is not present when you read it in, use a default value (so older databases could be loaded no problem). I really can’t say anything bad about XML. Verbose yes, but so what. Made it easy to maintain. Disk space is ‘cheap’ so size is no big deal. As I typed this I just remembered we currently use it to transfer energy information in XML format to other entities (companies, utilities, government) as well. So yeah, still well used. Note that we ‘really’ liked it for the control system applications as you could completely use different data structure in code or re-arrange fields in the data structures, make them different sizes (like go from 16 bits to 32 bits fields) and database could be read in as usual. Before we used XML we would generate a ‘binary’ file from all the ascii database files (our own format) which had to match the program data structures for the system to use along with a ‘reference table’ to tell the application where to ‘hook up’ to each data structure in memory. By going to XML we eliminated that step with the ‘cost’ of having to parse the XML files at start up, so a slightly longer initialization time before the control system was ‘on-line’. Anyway …

    In my current job, I usually use json instead of XML as it maps well into lists and dictionaries common in Python, C#. I use Json a lot for small database applications. Editing Json files by hand is a bit more tedious but not bad if necessary. Again any text editor will do. In Python of course you just use dump() and load() functions which take care of the details, so usually never have to touch the files.

  12. As one of those who participated in XML’s development, my biggest frustration is that the proponents of the “anti-ml” forms insisted on a gratuitous break from all the work we had invested. If folks had been willing to work _with_ XML folks rather than insisting on reinventing everything from scratch, there was a lot of tooling that could have been trivially made available to the new syntaxes.

    It would have been absolutely trivial to map JSON to the XML Infoset, and immediately gain the ability to run XSLT/XPath/XQuery (to take one of my projects) with input, output, or both going directly from and to JSON. It would have been a trifle more difficult to map the stylesheet language itself due to its use of XML Namespaces, but that would not be an impossible migration. Instead we had to watch folks try to reinvent the query language concept, repeatedly, in competition with each other, and with deliberate refusal to consider that there might be something to learn — and manpower that could be tapped — in what had already been done.

    I understand why folks decided XML was too complicated for them, but this was a serious missed bet, even if folks used it only as transitional tools.

    Sigh. Oh well. Nature of the profession. See XKCD re “standards”.

  13. I spend my days working on XML files. It’s the standard export format for a content management system my company uses a lot. Publishing books involves exporting them to XML, doing an XSL-FO transformation on them and then rendering them as PDF.
    There’s a lot I appreciate about XML: the tags make it self-documenting to an extent. The tooling is excellent (XSLT, XPATH, XSL-FO), which makes it easy to exchange data with other systems.
    Its verbosity makes for large files, but it also meant I could figure out what data is in these files just by reading them.

  14. I remember reading the XML documentation on w3.org in the early 2000s, and finding design goal no 10 for XML strangely poetic. It has stuck with me as a life motto of sorts:

    Terseness is of minimal importance.

    :D

    (OK, now when I checked the documentation on w3.org as I write this comment I see that it actually reads:
    “10. Terseness in XML markup is of minimal importance.”
    Maybe not as poetic…)

  15. For a while there were folks saying XML, and DITA in particular, was getting long in the tooth. They punted because it was too complex for programmers to contribute to the techdoc (pitty them). They punted to Markdown and RsT under pressure from the programmer types who were not experts in techdoc and the teething issues of the field. Although they were (past tense) simpler formats for writing doc, they are not algorithmically usable, not like DOM models. Now we’re finding that XML (and DITA in particular) is the IDEAL format for the emerging world of knowledge-managed documents. Not only does it support omni-channel publishing, but combined with advanced semantic technologies (taxonomy, ontologies, and knowledge graphs), it has become unbeatable. That’s because when combined, the result is truly a self-describing content corpus. It’s not XML or DITA per se alone, it’s any format that uses a document object model (DOM), which includes various XML dialects including DITA and JSON. Plus, the newer component content management systems (cCMS), content delivery platforms (CDPs), and XML authoring tools make creating, managing, reusing, and repurposing content easier and simpler than using even Markdown or RsT. Go figure.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.