Apache Daffodil | Daffodil and the DFDL Infoset

Daffodil is an implementation of DFDL which supports multiple methods to represent the DFDL Infoset, including various XML representations and JSON. However, the DFDL Infoset is somewhat different from the representations that Daffodil creates since Daffodil approximates the DFDL Infoset using a subset of features of XML/JSON. The below tables describe how Daffodil maps the DFDL Infoset to the supported representations.

Document Information Item	org.jdom.Document
root	org.jdom.Element getRootElement()
dfdlVersion	not yet implemented
schema (reserved for future use)	not yet implemented
unicodeByteOrderMark	not yet implemented
Element Information Item	org.jdom.Element
namespace	org.jdom.Namespace getNamespace()
name	String getName()
document	org.jdom.Document getDocument()
datatype	not yet implemented
dataValue	For simple types other than xs:string, the canonical XML representation of the value as returned by String getText(). See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled	The "nilled" attribute in the "xsi" namespace.
children	java.util.List<Element> getChildren()
parent	org.jdom.Parent getParent()
schema	not yet implemented
valid	not yet implemented
unionMemberSchema	not yet implemented
"No Value"	An org.jdom.Element with no children (not even Text nodes) is the representation of an element with "no value".
Augmented Infoset	not yet implemented

Document Information Item	org.w3c.dom.Document
root	org.w3c.dom.Node getFirstChild()
dfdlVersion	not yet implemented
schema (reserved for future use)	not yet implemented
unicodeByteOrderMark	not yet implemented
Element Information Item	org.w3c.dom.Element
namespace	String getNamespaceURI()
name	String getNodeName() if getNamespaceURI() == null, String getLocalName() otherwise
document	org.jdom.Document getOwnerDocument()
datatype	not yet implemented
dataValue	For simple types other than xs:string, the canonical XML representation of the value as returned by String getWholeText(). See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled	The "nilled" attribute in the "xsi" namespace.
children	org.w3c.dom.NodeList getChildNodes()
parent	org.w3c.dom.Node getParentNode()
schema	not yet implemented
valid	not yet implemented
unionMemberSchema	not yet implemented
"No Value"	An org.w3c.dom.Element with no children (not even Text nodes) is the representation of an element with "no value".
Augmented Infoset	not yet implemented

Document Information Item	The document is represented by the root element. There is no separate document item.
root	root element of the node
dfdlVersion	not yet implemented
schema (reserved for future use)	not yet implemented
unicodeByteOrderMark	not yet implemented
Element Information Item	scala.xml.Elem
namespace	def namespace: String
name	def name: String
document	not supported
datatype	not yet implemented
dataValue	For simple types other than xs:string, the canonical XML representation of the value as returned by def text: String. See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled	The "nilled" attribute in the "xsi" namespace.
children	def child: Node*
parent	not supported
schema	not yet implemented
valid	not yet implemented
unionMemberSchema	not yet implemented
"No Value"	A scala.xml.Elem with no children.
Augmented Infoset	not yet implemented

Document Information Item	The full text is the document.
root	The first XML tag in the document.
dfdlVersion	not yet implemented
schema (reserved for future use)	not yet implemented
unicodeByteOrderMark	not yet implemented
Element Information Item	An XML tag
namespace	Defined using standard XML namespacing (e.g. xmlns="..." and element prefixes)
name	XML tag name
document	The full text is the document
datatype	not yet implemented
dataValue	For simple types other than xs:string, the canonical XML representation of the value inside the opening/closing XML tags. See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled	The "nilled" attribute in the "xsi" namespace.
children	Child XML tags
parent	Parent XML tags
schema	not yet implemented
valid	not yet implemented
unionMemberSchema	not yet implemented
"No Value"	An XML tag with no content in between the opening and closing tags
Augmented Infoset	not yet implemented

Document Information Item	The full text is the document, containing a JSON single object.
root	The first (and only) JSON string in the document object.
dfdlVersion	not yet implemented
schema (reserved for future use)	not yet implemented
unicodeByteOrderMark	not yet implemented
Element Information Item	The first JSON string in an object.
namespace	not supported
name	The first JSON string in an object.
document	The full text is the document
datatype	not yet implemented
dataValue	For simple types other than xs:string, the canonical XML representation of the value inside double quotes. For xs:string types, a JSON escaped string in double quotes.
nilled	The value of the element is null
children	Child JSON objects
parent	Parent JSON tags
schema	not yet implemented
valid	not yet implemented
unionMemberSchema	not yet implemented
"No Value"	The value of the element is empty double quotes.
Augmented Infoset	not yet implemented

Document Information Item	All callbacks between (inclusive) org.xml.sax.ContentHandler#startDocument and endDocument
root	startElement callback
dfdlVersion	not yet implemented
schema (reserved for future use)	not yet implemented
unicodeByteOrderMark	not yet implemented
Element Information Item	StartElement event
namespace	Accessible three ways: uri parameter of StartElement/EndElement callbacks qName parameter together with StartPrefixMapping/EndPrefixMapping callbacks qName and Attribute parameter of StartElement callback
name	localName or qName parameter of startElement callback
document	not yet implemented
datatype	not yet implemented
dataValue	For simple types other than xs:string, the canonical XML representation of the value from the characters callback. See XML Illegal Characters for xs:string types containing XML illegal characters.
nilled	Using the getIndex method of the Attributes parameter of startElement with the XSI uri and the "nil" localName as arguments.
children	not supported
parent	not supported
schema	not yet implemented
valid	not yet implemented
unionMemberSchema	not yet implemented
"No Value"	not supported
Augmented Infoset	not yet implemented

XML Illegal Characters

Since DFDL strings can contain characters that are not allowed in XML at all, for the XML based representations, these characters are mapped into the Unicode Private Use Areas (PUA). This is similar to the scheme used by Microsoft Visio (See: https://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx), but extended to handle all the XML 1.0 illegal characters including those with 16-bit codepoint values. This mapping is used bi-directionally, that is, illegal characters are replaced by their legal counterparts when parsing, and the reverse transformation is performed when unparsing, thereby allowing the creation of data streams containing the XML illegal characters from legal XML documents that contain only the mapped PUA corresponding characters.

These are the legal XML characters (for XML v1.0):

 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

All other characters are illegal. Illegal characters from #x00 to #x1F are mapped to the PUA by adding #xE000 to their character code. Hence, the NUL (#x0) character code becomes #xE000.

Illegal characters from #xD800 to #xDFFF are mapped to the PUA by adding #x1000 to their character code. So #xD800 maps to #xE800, and #xDFFF maps to #xEFFF.

Illegal characters #xFFFE and #xFFFF are mapped to the PUA by subtracting #x0F00 from their character code, so to characters #xF0FE and #xF0FF.

The legal character #xD (Carriage Return or CR) is mapped to #xA (Line Feed, or LF). The CR character is allowed in the textual representation of XML documents, but is always converted to LF in the XML Infoset. That is, it is read by XML processors, but CRLF is converted to just LF, and CR alone is converted to LF. Daffodil is in a sense a different 'reader' of data into the XML infoset, so to be consistent with XML we map CR and CRLF to LF.

It is a processing error when parsing if the data-stream contains characters in the parts of the PUA used by this mapping for illegal XML codepoints. When unparsing, the characters such as #xE000 found in the infoset string values are mapped back to the corresponding illegal character code points (#xE000 becomes #x0, aka NUL).

The XML for an infoset can embed the #xE000 character or any of the other "illegal" characters mapped into the PUA conveniently by use of XSD numeric character entities such as "". This is turned into the #xE000 code point when the XML document is loaded. Daffodil will then map this when unparsing, to #x0 (aka NUL).

It is a processing error if any DFDL infoset string character is created with a character code greater than #x10FFFF.