The structure of the input records generated by the XML input format is determined at run time, depending on the document being parsed, and on the values specified for the input format parameters.
The XML input format parses an XML document by "visiting" the nodes in the document, and the input record fields are the attributes and values of the nodes that are visited by the XML input format.
By default, nodes are visited from the document root,
that is, the single top-level node in an XML document that contains
all the other nodes in the document.
However, by supplying an XPath in either the from-entity or as a value of the rootXPath parameter, users can specify
that the document nodes are to be visited starting from the node(s)
selected by the XPath.
Before parsing the XML document and return the input records,
the XML input format initially examines the nodes found along the
paths from the root node or from the node(s) selected by the
user-supplied root XPath to the first n leaf nodes, where
n is the value of the dtNodes parameter.
During this phase, the XML input format creates a representation of
the tree structure ("schema" tree) by merging nodes with the same
name and hierarchical position. When completed, the schema tree
contains one single instance of each node type, and each node
contains an attribute set equal to the union of all the attributes
found in the nodes of that type.
At this moment, an input record field is created for each attribute
belonging to a node type and for each node type having a value.
Once the schema tree has been determined and the input record
structure has been created, the XML input format parses the XML
document and generates input records, visiting the document nodes
and extracting their values and attributes.
The XML input format implements three different algorithms to
decide how document nodes will be visited. The three algorithms
represent three different ways in which the information contained
in an XML document can be retrieved, and the choice of an algorithm
depends on the structure of the document and on the structure of
the information that needs to be extracted.
Since different algorithms visit different sets of nodes, the
choice of an algorithm affects which fields (i.e. which node
attributes and values) will be contained in the input records.
Users can specify the algorithm to use through the fMode ("field mode") parameter, which can
be set to "Branch", "Tree", or "Node".
This mode is appropriate for documents in which each
hierarchical level consists of nodes of the same type, as depicted
in the following diagram:
In this structure, the root node contains only nodes of type "A",
and each "A" node contains only nodes of type "B".
For example, the root of the following XML document contains
"Continent" nodes only; each "Continent" node contains "Country"
nodes only, and each "Country" node contains "City" nodes only:
<?xml version="1.0" ?> <World> <Continent ContinentName='North America'> <Country CountryName='USA'> <City> Redmond </City> <City> San Francisco </City> </Country> <Country CountryName='Canada'> <City> Vancouver </City> <City> Toronto </City> </Country> </Continent> <Continent ContinentName='Europe'> <Country CountryName='Italia'> <City Population='3350000'> Roma </City> <City> Milano </City> </Country> </Continent> </World>This document can be thought of as containing six "entries", the leaf "City" nodes, with the information associated with each entry being contained in the nodes that are encountered along a path from the root node to the leaf node.
The schema tree extracted from this example document specifies
that the document root node contains nodes of the "Continent" type,
and that nodes of this type have a "ContinentName" attribute.
"Continent" nodes, in turn, contain nodes of the "Country" type,
with a "CountryName" attribute; finally, "Country" nodes contain
nodes of the "City" type, and nodes of this type have a value, and
a "Population" attribute.
The input records generated after the schema tree would thus
contain four fields: "ContinentName", "CountryName", "City", and
"Population".
When using the "Branch" field mode, the XML input format generates an input record for each path from the document root node or from the node(s) selected by the user-supplied root XPath to all the leaf nodes. Each input record contains the attributes and values of the nodes encountered along the path:
Record 1 |
Record 2 |
Record 3 |
Record 4 |
Record 5 |
ContinentName CountryName City Population ------------- ----------- ------------- ---------- North America USA Redmond - North America USA San Francisco - North America Canada Vancouver - North America Canada Toronto - Europe Italia Roma 3350000 Europe Italia Milano -
This mode is appropriate for documents in which a specific
hierarchical level contains child nodes all having different types,
as depicted in the following diagram:
In this structure, the root node contains only nodes of type "A";
each "A" node however contains nodes all having different types (a
single "B" node, a single "C" node, and a single "D" node).
For example, the root of the following XML document contains
"Message" nodes; each "Message" node contains a single "From" node,
a single "To" node, and a single "Body" node:
<?xml version="1.0" ?> <Messages> <Message Date='2004-05-28T12:24:05'> <From> Gabriele </From> <To> Monica </To> <Body> How's going? </Body> </Message> <Message Date='2004-05-28T13:01:14'> <From> Monica </From> <To> Gabriele </To> <Body> Fine, thanks. </Body> </Message> </Messages>This document can be thought of as containing two "entries", the "Message" subtrees, with the information associated with each entry being contained in all the nodes in the subtree and in the nodes that are encountered along a path from the root node to the subtree root.
The schema tree extracted from this example document specifies
that the document root node contains nodes of the "Message" type,
and that nodes of this type have a "Date" attribute. "Message"
nodes, in turn, contain nodes of the "From", "To", and "Body"
types, each type having a node value.
The input records generated after the schema tree would thus
contain four fields: "Date", "From", "To", and "Body".
When using the "Tree" field mode, the XML input format generates an input record for each subtree that includes all nodes of distinct types. Each input record contains the attributes and values of the nodes found in the subtrees, together with the attributes and values of the nodes encountered along the paths from the document root node or from the node(s) selected by the user-supplied root XPath to the subtree root nodes:
Record 1 |
Record 2 |
Date From To Body ------------------- -------- -------- ------------- 2004-05-28 12:24:05 Gabriele Monica How's going? 2004-05-28 13:01:14 Monica Gabriele Fine, thanks.
While parsing an XML document in "Tree" mode, if a subtree is
found containing multiple instances of the same node type, that
subtree is "replicated" combinatorially to generate all the
possible subtrees containing one single instance of each node
type.
The following diagram depicts an XML document in which a subtree
contains multiple instances of the same node type:
In this diagram, the "A" node contains one instance of the "B" node
type, two instances of the "C" node type, and two instances of the
"D" note type.
For example, the "Message" node in the following XML document
contains a single "From" node, two "To" nodes, and two "Body"
nodes:
<?xml version="1.0" ?> <Messages> <Message Date='2004-05-28T12:24:05'> <From> Gabriele </From> <To> Jeff </To> <To> Steve </To> <Body Language='ENU'> Review ready? </Body> <Body Language='ITA'> E' pronta la review? </Body> </Message> </Messages>This document can be thought of as a "compact" representation of four different messages:
When using the "Tree" field mode, these "Message" subtrees are replicated combinatorially to generate all the possible subtrees containing one single instance of each of the "From", "To", and "Body" node types:
Record 1 |
Record 2 |
Record 3 |
Record 4 |
Date From To Body Language ------------------- -------- ----- -------------------- -------- 2004-05-28 12:24:05 Gabriele Jeff Review ready? ENU 2004-05-28 12:24:05 Gabriele Jeff E' pronta la review? ITA 2004-05-28 12:24:05 Gabriele Steve Review ready? ENU 2004-05-28 12:24:05 Gabriele Steve E' pronta la review? ITA
This mode is appropriate for situations in which the information
to be retrieved is associated with a specific node type only.
For example, the relevant information in the document depicted by
the following diagram might be associated with "B" node types
only:
When using the "Node" field mode, the XML input format generates an input record for each root node, either the document root or the node(s) selected by the user-supplied root XPath. Each input record contains the attributes and values of that node only:
Record 1 |
Record 2 |
CountryName ----------- USA Canada Italia
As an example, the following help command displays the input record structure determined by the XML input format when parsing the previous "Cities" example XML document:
C:\>LogParser -h -i:XML Cities.xmlThe structure displayed by this help command will be:
Fields: ContinentName (S) CountryName (S) City (S) Population (I)
© 2004 Microsoft Corporation. All rights reserved.