Daffodil is an implementation of DFDL which supports multiple methods to represent the DFDL Infoset, including various XML representations and JSON. However, the DFDL Infoset is somewhat different from the representations that Daffodil creates since Daffodil approximates the DFDL Infoset using a subset of features of XML/JSON. The below tables describe how Daffodil maps the DFDL Infoset to the supported representations.
Document Information Item | org.jdom.Document |
root | org.jdom.Element getRootElement() |
dfdlVersion | not yet implemented |
schema (reserved for future use) | not yet implemented |
unicodeByteOrderMark | not yet implemented |
Element Information Item | org.jdom.Element |
namespace | org.jdom.Namespace getNamespace() |
name | String getName() |
document | org.jdom.Document getDocument() |
datatype | not yet implemented |
dataValue | For simple types other than xs:string, the canonical XML representation of the value as returned by String getText(). See XML Illegal Characters for xs:string types containing XML illegal characters. |
nilled | The "nilled" attribute in the "xsi" namespace. |
children | java.util.List<Element> getChildren() |
parent | org.jdom.Parent getParent() |
schema | not yet implemented |
valid | not yet implemented |
unionMemberSchema | not yet implemented |
"No Value" | An org.jdom.Element with no children (not even Text nodes) is the representation of an element with "no value". |
Augmented Infoset | not yet implemented |
Document Information Item | org.w3c.dom.Document |
root | org.w3c.dom.Node getFirstChild() |
dfdlVersion | not yet implemented |
schema (reserved for future use) | not yet implemented |
unicodeByteOrderMark | not yet implemented |
Element Information Item | org.w3c.dom.Element |
namespace | String getNamespaceURI() |
name | String getNodeName() if getNamespaceURI() == null, String getLocalName() otherwise |
document | org.jdom.Document getOwnerDocument() |
datatype | not yet implemented |
dataValue | For simple types other than xs:string, the canonical XML representation of the value as returned by String getWholeText(). See XML Illegal Characters for xs:string types containing XML illegal characters. |
nilled | The "nilled" attribute in the "xsi" namespace. |
children | org.w3c.dom.NodeList getChildNodes() |
parent | org.w3c.dom.Node getParentNode() |
schema | not yet implemented |
valid | not yet implemented |
unionMemberSchema | not yet implemented |
"No Value" | An org.w3c.dom.Element with no children (not even Text nodes) is the representation of an element with "no value". |
Augmented Infoset | not yet implemented |
Document Information Item | The document is represented by the root element. There is no separate document item. |
root | root element of the node |
dfdlVersion | not yet implemented |
schema (reserved for future use) | not yet implemented |
unicodeByteOrderMark | not yet implemented |
Element Information Item | scala.xml.Elem |
namespace | def namespace: String |
name | def name: String |
document | not supported |
datatype | not yet implemented |
dataValue | For simple types other than xs:string, the canonical XML representation of the value as returned by def text: String. See XML Illegal Characters for xs:string types containing XML illegal characters. |
nilled | The "nilled" attribute in the "xsi" namespace. |
children | def child: Node* |
parent | not supported |
schema | not yet implemented |
valid | not yet implemented |
unionMemberSchema | not yet implemented |
"No Value" | A scala.xml.Elem with no children. |
Augmented Infoset | not yet implemented |
Document Information Item | The full text is the document. |
root | The first XML tag in the document. |
dfdlVersion | not yet implemented |
schema (reserved for future use) | not yet implemented |
unicodeByteOrderMark | not yet implemented |
Element Information Item | An XML tag |
namespace | Defined using standard XML namespacing (e.g. xmlns="..." and element prefixes) |
name | XML tag name |
document | The full text is the document |
datatype | not yet implemented |
dataValue | For simple types other than xs:string, the canonical XML representation of the value inside the opening/closing XML tags. See XML Illegal Characters for xs:string types containing XML illegal characters. |
nilled | The "nilled" attribute in the "xsi" namespace. |
children | Child XML tags |
parent | Parent XML tags |
schema | not yet implemented |
valid | not yet implemented |
unionMemberSchema | not yet implemented |
"No Value" | An XML tag with no content in between the opening and closing tags |
Augmented Infoset | not yet implemented |
Document Information Item | The full text is the document, containing a JSON single object. |
root | The first (and only) JSON string in the document object. |
dfdlVersion | not yet implemented |
schema (reserved for future use) | not yet implemented |
unicodeByteOrderMark | not yet implemented |
Element Information Item | The first JSON string in an object. |
namespace | not supported |
name | The first JSON string in an object. |
document | The full text is the document |
datatype | not yet implemented |
dataValue | For simple types other than xs:string, the canonical XML representation of the value inside double quotes. For xs:string types, a JSON escaped string in double quotes. |
nilled | The value of the element is null |
children | Child JSON objects |
parent | Parent JSON tags |
schema | not yet implemented |
valid | not yet implemented |
unionMemberSchema | not yet implemented |
"No Value" | The value of the element is empty double quotes. |
Augmented Infoset | not yet implemented |
Document Information Item | All callbacks between (inclusive) org.xml.sax.ContentHandler#startDocument and endDocument |
root | startElement callback |
dfdlVersion | not yet implemented |
schema (reserved for future use) | not yet implemented |
unicodeByteOrderMark | not yet implemented |
Element Information Item | StartElement event |
namespace | Accessible three ways:
|
name | localName or qName parameter of startElement callback |
document | not yet implemented |
datatype | not yet implemented |
dataValue | For simple types other than xs:string, the canonical XML representation of the value from the characters callback. See XML Illegal Characters for xs:string types containing XML illegal characters. |
nilled | Using the getIndex method of the Attributes parameter of startElement with the XSI uri and the "nil" localName as arguments. |
children | not supported |
parent | not supported |
schema | not yet implemented |
valid | not yet implemented |
unionMemberSchema | not yet implemented |
"No Value" | not supported |
Augmented Infoset | not yet implemented |
Since DFDL strings can contain characters that are not allowed in XML at all, for the XML based representations, these characters are mapped into the Unicode Private Use Areas (PUA). This is similar to the scheme used by Microsoft Visio (See: https://msdn.microsoft.com/en-us/library/office/aa218415%28v=office.10%29.aspx), but extended to handle all the XML 1.0 illegal characters including those with 16-bit codepoint values. This mapping is used bi-directionally, that is, illegal characters are replaced by their legal counterparts when parsing, and the reverse transformation is performed when unparsing, thereby allowing the creation of data streams containing the XML illegal characters from legal XML documents that contain only the mapped PUA corresponding characters.
These are the legal XML characters (for XML v1.0):
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
All other characters are illegal.
Illegal characters from #x00
to #x1F
are mapped to the PUA
by adding #xE000
to their character code. Hence, the NUL (#x0) character code becomes #xE000.
Illegal characters from #xD800
to #xDFFF
are mapped to the PUA by adding
#x1000
to their character code. So #xD800
maps to #xE800
, and
#xDFFF
maps to #xEFFF
.
Illegal characters #xFFFE
and #xFFFF
are mapped to the PUA by
subtracting #x0F00
from their character code, so to characters #xF0FE
and #xF0FF
.
The legal character #xD
(Carriage Return or CR) is mapped to #xA
(Line Feed, or
LF). The CR character is allowed in the textual representation of XML
documents, but is always converted to LF in the XML Infoset. That is, it is
read by XML processors, but CRLF is converted to just LF, and CR alone is
converted to LF. Daffodil is in a sense a different 'reader' of data into the
XML infoset, so to be consistent with XML we map CR and CRLF to LF.
It is a processing error when parsing if the data-stream contains characters in the parts of the PUA used by this mapping for illegal XML codepoints. When unparsing, the characters such as #xE000 found in the infoset string values are mapped back to the corresponding illegal character code points (#xE000 becomes #x0, aka NUL).
The XML for an infoset can embed the #xE000 character or any of the other "illegal" characters mapped into the PUA conveniently by use of XSD numeric character entities such as "". This is turned into the #xE000 code point when the XML document is loaded. Daffodil will then map this when unparsing, to #x0 (aka NUL).
It is a processing error if any DFDL infoset string character is created with a
character code greater than #x10FFFF
.