This is a story of xml. Or rather Apple’s plist files, which happen to be xml format. And me going down a very brief rabbit hole of figuring out why certain things are in xml files and how to find out. Spoiler, it’s all about the DTD, or data type definition.
Here’s a plist file. I’m using the Steam client’s in this case:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>Label</key> <string>com.valvesoftware.steamclean</string> <key>Program</key> <string>/Users/javorszky/Library/Application Support/Steam/SteamApps/steamclean</string> <key>ProgramArguments</key> <array> <string>/Users/javorszky/Library/Application Support/Steam/SteamApps/steamclean</string> <string>Public</string> </array> <key>RunAtLoad</key> <true/> <key>SteamContentPaths</key> <array> <string>/Users/javorszky/Library/Application Support/Steam/SteamApps</string> </array> <key>ThrottleInterval</key> <integer>60</integer> <key>WatchPaths</key> <array> <string>/Users/javorszky/Library/Application Support/Steam/Steam.AppBundle/Steam</string> </array> </dict> </plist>
There’s a lone
<true/> self closing element in there. Before I began I thought that xml documents were proper, and they always had nodes with text inside them, so a single self closing element with nothing around it or within it felt weird. Enter its DTD, or document type definition. Here’s that document that’s linked from the
DOCTYPE tag of the xml.
<!ENTITY % plistObject "(array | data | date | dict | real | integer | string | true | false )" > <!ELEMENT plist %plistObject;> <!ATTLIST plist version CDATA "1.0" > <!-- Collections --> <!ELEMENT array (%plistObject;)*> <!ELEMENT dict (key, %plistObject;)*> <!ELEMENT key (#PCDATA)> <!--- Primitive types --> <!ELEMENT string (#PCDATA)> <!ELEMENT data (#PCDATA)> <!-- Contents interpreted as Base-64 encoded --> <!ELEMENT date (#PCDATA)> <!-- Contents should conform to a subset of ISO 8601 (in particular, YYYY '-' MM '-' DD 'T' HH ':' MM ':' SS 'Z'. Smaller units may be omitted with a loss of precision) --> <!-- Numerical primitives --> <!ELEMENT true EMPTY> <!-- Boolean constant true --> <!ELEMENT false EMPTY> <!-- Boolean constant false --> <!ELEMENT real (#PCDATA)> <!-- Contents should represent a floating point number matching ("+" | "-")? d+ ("."d*)? ("E" ("+" | "-") d+)? where d is a digit 0-9. --> <!ELEMENT integer (#PCDATA)> <!-- Contents should represent a (possibly signed) integer number in base 10 -->
This is the structure of the xml. Or rather the specific xml that I’m looking at, Apple’s flavour of xml that it uses as a plist, or property list. There’s a lot more information about plists on Apple’s sites, but I wanted to understand how this DTD works, so let’s do a bunch of passes and figure out what’s what.
Most of what I know about it came from the following links, so this article is a summary of them applied to the above DTD.
Parts of this DTD
Immediately I can see we have
After having read the above documents, I know that the
ATTLIST are beginnings of a definition. They are then followed by the name of the thing, and then the content.
<!ELEMENT true EMPTY>
This means that we’re declaring an element, where the tag name is
true, and it does not have anything inside it. Essentially our
<true/>. On the line there’s this thing following it:
<!-- Boolean constant true -->
You might recognise that that looks exactly like a comment in HTML. It serves the same purpose here as well, that’s a comment to aid the humans reading this.
Okay, so what’s this?
<!ENTITY % plistObject "(array | data | date | dict | real | integer | string | true | false )" >
That one declares an entity. Think of it as a placeholder, or a variable. The name is
plistObject, and the content is an array, or some data, or a date, or a dict, or a real, or an integer, or a string, or true, or false. It’s one, and only one of those in the list. They don’t mean anything in and of themselves, they are list of
ELEMENT names, and they all need to be declared in the same DTD.
It took a while to figure out what the hell the
% was doing there, because two out of the three references only showed them working, but they didn’t say anything about it. The “Understanding XML DTDs” article however had an example of both of these:
<!ENTITY name value> <!ENTITY % name value>
And it explained that the first one is a general entity, the second one, due to the percent sign, is a parameter entity.
With all the above, we can expect the
plistObject to be reused in other places as well. We’ll get into that, but first we need to talk about the last keyword.
That’s an attribute list. It attaches to an existing element, declares its name, the type of data that can go there, and the required value, if any. Let’s look at the lone
<!ATTLIST plist version CDATA "1.0" >
plist has an attribute called
version, which is parsed character data (
CDATA) with value
"1.0" in it. If we look at the actual plist file from Steam, sure enough:
Elements in the DTD
The definition has three sections: collections, primitives, and numerical primitives. They are denoted with comments.
In your everyday programming language, this would be your set, map, array, list, whatever, essentially a bag of many individual elements. Here we have three of them.
<!ELEMENT array (%plistObject;)*>
Let’s start with this one. It defines an element, that element is named
array, and the value is
(%plistObject;)*. Let’s break that down starting from the outside. The star,
*, is a control character, similar to regex. It means zero or more elements of whatever it is following. The parentheses,
(), are for grouping elements, which means there’s more than one of them within those. But there’s only one, you say. Ah, but the thing inside,
%plistObject;, is a parameter entity that you’ve read about. It roughly follows the structure of an HTML entity. The £ would be encoded as
£, except ours uses a percent sign (
%) instead of an ampersand (
&) to denote the type of entity being used. The semicolon is the end of the name of the entity.
%plistObject; gets expanded into its value, so these two are (or at least should be) equivalent:
<!ELEMENT array (%plistObject;)*> <!ELEMENT array (array | data | date | dict | real | integer | string | true | false )*>
To recap, the array element can have zero or more of any combination of an array (embedding itself), data, date, dict, real, integer, string, true, or false, all of which are elements themselves. For example:
<array> <string>something</string> <integer>-45</integer> <array> <string>hello world</string> </array> </array>
<!ELEMENT dict (key, %plistObject;)*>
Most of it is the same as in the array one, with one key difference. The grouping now contains
key, %plistObject;, instead of just the object. The comma there means that those elements need to appear in that specific order.
key here is another element.
To turn that into human text: the dict element is zero or more pairs of a key element followed by any of the elements that
plistObject can be, which includes another dict. For example:
<dict> <key>RandomValue</key> <integer>42</integer> <key>ShouldThisExist</key> <false/> </dict>
<!ELEMENT key (#PCDATA)>
Last in our collection elements is the key. Nothing new, except the content of the grouping:
#PCDATA. It stands for parsed character data. Content inside can only be plain text and special characters need to be represented by their character entities.
This is used in the dict element. Also note that the
%plistObject; does not have the
key as a possible element, which means this one can only be part of a dict.
See the example in the dict section.
<!ELEMENT string (#PCDATA)>
string with plain text escaped content, contents can be arbitrary. For example:
<string>This ice cream is from Ben & Jerry’s</string>
<!ELEMENT data (#PCDATA)> <!-- Contents interpreted as Base-64 encoded -->
data with plain text escaped content. This is where we need to separate the xml’s DTD, and how Apple uses its plist files. From a structural, “is this a valid xml?” question, any random, arbitrary content is valid as long as it’s escaped plain text. However the parser that makes sense of the plist file expects the content to not only be escaped plain text, but also be a valid base64 encoded value. I would expect the parser to throw an error even though the xml validator would probably have given a thumbs up for random content here.
For example, the text
This ice cream is from Ben & Jerry’s would be:
<!ELEMENT date (#PCDATA)> <!-- Contents should conform to a subset of ISO 8601 (in particular, YYYY '-' MM '-' DD 'T' HH ':' MM ':' SS 'Z'. Smaller units may be omitted with a loss of precision) -->
Same as above, but this time it’s named
date, and the content is the one true date format: ISO8601. For example:
We got to the last group, let’s churn through these. I’m going to go from bottom up.
<!ELEMENT integer (#PCDATA)> <!-- Contents should represent a (possibly signed) integer number in base 10 -->
Similarly to the
data element, the content can be a valid xml and an invalid plist value. For example:
<!ELEMENT real (#PCDATA)> <!-- Contents should represent a floating point number matching ("+" | "-")? d+ ("."d*)? ("E" ("+" | "-") d+)? where d is a digit 0-9. -->
Similar to the
date, but the restriction on the content from the plist side is different. For example:
<!ELEMENT true EMPTY> <!-- Boolean constant true --> <!ELEMENT false EMPTY> <!-- Boolean constant false -->
This is where we get the
false tags. What does
EMPTY mean though? It’s the same as the tags in HTML where a tag does not have content inside it, like
<hr>. Also called a “self-closing element” in HTML occasionally. Here’s a more detailed explanation about empty elements from the O’Reilly definitive guide to html and xhtml. For example:
Per the DTD, the
<true/> is a self closing, or empty, element that’s valid in the context of a plist xml.
Also decoding DTDs is kind of fun. Hope this was of use. If you wouldn’t mind to drop a like, subscribe, and follow for more webtech geekery, I’d sure appreciate it! 😅