Microsoft.NET

……………………………………………….Expertise in .NET Technologies

Introduction to XML

Posted by Ravi Varma Thumati on May 12, 2010

Extensible Markup Language, or XML for short, is a new technology for web applications. XML is a World Wide Web Consortium standard that lets you create your own tags. XML simplifies business-to business transactions on the web. XML is a markup language for documents containing structured information.

What is XML?

  • XML stands for EXtensible Markup Language
  • XML is a markup language much like HTML.
  • XML was designed to describe data.
  • XML tags are not predefined in XML. You must define your own tags.
  • XML uses a DTD (Document Type Definition) to formally describe the data.
  • XML with a DTD is designed to be self describing.
  • XML is a W3C Recommendation

It is important to understand that XML is not a replacement for HTML. The main purpose of HTML is the Format the Data that is presented through Browser. For Displaying data on Handheld Devices WML is used. The purpose of XML is not to Format the Data to be displayed. It’s mostly used to store and transfer data and to describe the data. It is device or Language independent and can be used for Transmitting Data to any device. The Parser (Or the Program which is capable of understanding the Tags and returning the Text in a Valid Format) on the corresponding Device will help in displaying the data in required format.

You can define your own tags in XML file. The way these tags will be interpreted will depend on the program which is going to get this XML file. The data embedded within these tags will be used according to logic implemented in the secondary program which is going to get this XML as Feed. This point will be clearer when we start explaining you about how to use the Parsers in next few pages.

The Difference between XML and HTML

XML is not a replacement for HTML.

XML and HTML were designed with different goals:

  • XML was designed to transport and store data, with focus on what data is.
  • HTML was designed to display data, with focus on how data looks.

HTML is about displaying information, while XML is about carrying information.

XML Does not DO Anything

Maybe it is a little hard to understand, but XML does not DO anything. XML was created to structure, store, and transport information.

The following example is a note to Tove from Jani, stored as XML:

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don’t forget me this weekend!</body>
</note>

The note above is quite self descriptive. It has sender and receiver information, it also has a heading and a message body.

But still, this XML document does not DO anything. It is just pure information wrapped in tags. Someone must write a piece of software to send, receive or display it.

XML is Just Plain Text

XML is nothing special. It is just plain text. Software that can handle plain text can also handle XML.

However, XML-aware applications can handle the XML tags specially. The functional meaning of the tags depends on the nature of the application.

With XML You Invent Your Own Tags

The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags are “invented” by the author of the XML document.

That is because the XML language has no predefined tags.

The tags used in HTML (and the structure of HTML) are predefined. HTML documents can only use tags defined in the HTML standard (like <p>, <h1>, etc.).

XML allows the author to define his own tags and his own document structure.

XML is Not a Replacement for HTML

XML is a complement to HTML.

It is important to understand that XML is not a replacement for HTML. In most web applications, XML is used to transport data, while HTML is used to format and display the data.

My best description of XML is this:

Why XML?

In order to appreciate XML, it is important to understand why it was created. XML was created so that richly structured documents could be used over the web. The only viable alternatives, HTML and SGML, are not practical for this purpose.

HTML, as we’ve already discussed, comes bound with a set of semantics and does not provide arbitrary structure.

SGML provides arbitrary structure, but is too difficult to implement just for a web browser. Full SGML systems solve large, complex problems that justify their expense. Viewing structured documents sent over the web rarely carries such justification.

This is not to say that XML can be expected to completely replace SGML. While XML is being designed to deliver structured content over the web, some of the very features it lacks to make this practical, make SGML a more satisfactory solution for the creation and long-time storage of complex documents. In many organizations, filtering SGML to XML will be the standard procedure for web delivery.

XML is Everywhere

We have been participating in XML development since its creation. It has been amazing to see how quickly the XML standard has developed, and how quickly a large number of software vendors has adopted the standard.

XML is now as important for the Web as HTML was to the foundation of the Web.

XML is everywhere. It is the most common tool for data transmissions between all sorts of applications, and is becoming more and more popular in the area of storing and describing information.

XML Development Goals

The XML specification sets out the following goals for XML:

  1. It shall be straightforward to use XML over the Internet. Users must be able to view XML documents as quickly and easily as HTML documents. In practice, this will only be possible when XML browsers are as robust and widely available as HTML browsers, but the principle remains.
  2. XML shall support a wide variety of applications. XML should be beneficial to a wide variety of diverse applications: authoring, browsing, content analysis, etc. Although the initial focus is on serving structured documents over the web, it is not meant to narrowly define XML.
  3. XML shall be compatible with SGML. Most of the people involved in the XML effort come from organizations that have a large, in some cases staggering, amount of material in SGML. XML was designed pragmatically, to be compatible with existing standards while solving the relatively new problem of sending richly structured documents over the web.
  4. It shall be easy to write programs that process XML documents. The colloquial way of expressing this goal while the spec was being developed was that it ought to take about two weeks for a competent computer science graduate student to build a program that can process XML documents.
  5. The number of optional features in XML is to be kept to an absolute minimum, ideally zero. Optional features inevitably raise compatibility problems when users want to share documents and sometimes lead to confusion and frustration.
  6. XML documents should be human-legible and reasonably clear. If you don’t have an XML browser and you’ve received a hunk of XML from somewhere, you ought to be able to look at it in your favorite text editor and actually figure out what the content means.
  7. The XML design should be prepared quickly. Standards efforts are notoriously slow. XML was needed immediately and was developed as quickly as possible.
  8. The design of XML shall be formal and concise. In many ways a corollary to rule 4, it essentially means that XML must be expressed in EBNF and must be amenable to modern compiler tools and techniques.
    There are a number of technical reasons why the SGML grammar cannot be expressed in EBNF. Writing a proper SGML parser requires handling a variety of rarely used and difficult to parse language features. XML does not.
  9. XML documents shall be easy to create. Although there will eventually be sophisticated editors to create and edit XML content, they won’t appear immediately. In the interim, it must be possible to create XML documents in other ways: directly in a text editor, with simple shell and Perl scripts, etc.
  10. Terseness in XML markup is of minimal importance. Several SGML language features were designed to minimize the amount of typing required to manually key in SGML documents. These features are not supported in XML. From an abstract point of view, these documents are indistinguishable from their more fully specified forms, but supporting these features adds a considerable burden to the SGML parser (or the person writing it, anyway). In addition, most modern editors offer better facilities to define shortcuts when entering text.

How Is XML Defined?

XML is defined by a number of related specifications:

1.      Extensible Markup Language (XML) 1.0

Defines the syntax of XML; The XML specification is the primary focus of this article.

2.      XML Pointer Language (XPointer) and XML Linking Language (XLink)

Defines a standard way to represent links between resources; In addition to simple links, like HTML’s <A> tag, XML has mechanisms for links between multiple resources and links between read-only resources. XPointer describes how to address a resource; XLink describes how to associate two or more resources.

3.      Extensible Style Language (XSL)

Defines the standard stylesheet language for XML;

As time goes on, additional requirements will be addressed by other specifications. Currently (Sep, 1998), namespaces (dealing with tags from multiple tag sets), a query language (finding out what’s in a document or a collection of documents), and a schema language (describing the relationships between tags, DTDs in XML) are all being actively pursued.

The Rules of XML

We need to look next at the rules that govern XML documents. The rules can get a little tedious so if you’re in a hurry, just have a quick glance through and refer back later. You’ll find that, once you get into writing your own XML documents, most of these rules will be pretty obvious.

The XML standard itself is available at http://www.w3.org/TR/2000/REC-xml-20001006. To save you a long read, the key rules are explained below. Note that if an XML document obeys these rules, it is said to be well formed (the word “valid” has another meaning in XML, which we’ll look at later):

These are the most important rules any XML must obey.

1. XML Version Required

All XML documents must begin with a statement that describes the version of the XML standard being used:

<?xml version="1.0"?>

The above is in fact a processing instruction.

2. Close your Tags!

Every XML tag must be properly closed. HTML is more relaxed here, allowing you to use tags like <img> and <br> without closing them. In XML these should be <br></br> or just <br /> if the tag contains no data.

3. XML Tags Must be Nested in the Correct Order

In HTML, a browser will allow you to have <i> <b> Hello World! </i> </b>. In XML this would have to be either <i> <b> Hello World! </b> </i> or <b> <i> Hello World! </i> </b>.

4. XML is Sensitive to UPPERCASE/lowercase

In XML <mytag /> is not the same as <MYTAG />! In HTML you can get away with this — a browser will (generally) treat <BODY></body> as being the same thing.

5. And I Quote…

XML attributes must have quotes around them. In HTML you can get away with <a href=mypage.html>It's a Link!</a>. In XML that has to be <a href="mypage.html">It's a Link!</a>.

6. An XML Document Must have at Least One Element

At least one element, known as the the root element must exist for an XML document to be well formed. This tag doesn’t have to contain anything, though, so the example below is acceptable:

<?xml version="1.0"?>

<root />

7. Naming your Tags

The way you name your XML tags is governed by the following rules;

  • tag names can contain letters, numbers, and other characters (e.g. <mytag3></mytag3> is fine)
  • tag names cannot contain spaces ( e.g. <my tag></my tag> is wrong)
  • tag names cannot start with the letters xml (including UPPER or mIXeDcase)
  • tag names cannot start with a number or punctuation character (e.g. <3mytag></3mytag and <.mytag></.mytag> are both wrong).

8. Special Characters

Within the data you place in a tag or attribute, certain characters must be replaced with entities to prevent them from being mixed up with XML tags and syntax. These characters are:

Character : Entity : Example

" : &quot; : <tag entity="Here is a quote &quot;" />

' : &apos; : <tag entity="Here is an apostrophy &apos;" />

< : &lt; : <tag>1 &lt; 2</tag>

> : &gt; : <tag>2 &gt; 1</xml_tag>

& : &amp; : <tag>Kramer &amp; Kramer</tag>

In PHP, the function htmlspecialchars() will achieve this.

9. New Lines and White Space

For new lines in XML, the XML standard supports carriage returns and linefeeds ( i.e. \r\n, \r and \n , as in most programming languages, are acceptable). Having said that, XML processors expected to ‘normalize’ these to \n during processing.

Whitespace in XML is regarded as space characters, new lines (above), and tab characters. If a document has no DTD (see below), all whitespace is must be preserved. If a DTD is provided with the XML document, if any element contains nothing but white space or other elements, the whitespace can be removed in processing the document – it’s down to the DTD (or XML Schema) to specify which elements should have their whitespace preserved.

In most cases you shouldn’t need to worry about this, but in particular where XSLT is concerned, to generate output for humans to read, you may need to be careful.

XML’s Syntactic Highlights

  1. HTML allows fairly loose structuring in which end tag, such as </p>, is optional. XML does not allow such omissions. Remember, an XML document is made up of elements, not tags. It requires that corresponding end tags follow all start tags.
    <paragraph>XML is the future of the Web</paragragh>
  2. All XML elements must be closed, tags without content and , therefore, without end tags must be closed in the following manner.
    <image url=”picture.gif”/>
    Along the same lines, empty elements (<film></film>) may be marked in the following manner :
    <film/>
  3. You cannot overlap elements. For example, the following code
    <actress>Cindy<lname>Crawford</actress></lname>
    is improper syntax. The following is correct:
    <actress>Cindy<lname>Crawford</lname></actress>
  4. All attribute values must be in quotes:
    <photograph url=”cindy.gif” width=”300px”/>
  5. The contents of an XML element is treated as data, white space is not ignored. Therefore
    <film>Sleeping with the enemy</film>
    is not equivalent to
    <film>Sleeping with the enemy</film>
  6. There will be times when you will want certain character data to be treated as such. For instance, if the contents of an XML element consists of some sample code, rather than replacing each reserved character with its decimal code equivalent you can simply mark it as character data:
    <?[CDATA[Rope]]>
  7. XML is case sensitive. The following element
    <sportsman>Sachin Tendulkar</sportsman>
    is not equivalent to
    <SPORTSMAN>Sachin Tendulkar</SPORTSMAN>

Document Tag Definition

Two other constructs are important when discussing XML syntax. The first is a DTD or Document Tag Definition. The first thing a parser does when it finds an XML document is look for the existence of a DTD. DTDs are not mandatory in XML but if they exist, they define the XML tags and structure for that document. DTDs are commonly used to define markup tags for a specific industry. By reading a DTD, an XML parser knows how to interpret the markup.

DTDs are also used to validate the “correctness” of an XML data stream. They contain information such as what elements are allowed in the file, what type of data is allowed in each element, whether a certain structure can repeat. The DTD for an XML document can be contained within the document itself or referenced externally.

What Do XML Documents Look Like?

If you are conversant with HTML or SGML, XML documents will look familiar. A simple XML document is presented in Example 1.

Example 1. A Simple XML Document

<?xml version="1.0"?>
 
<oldjoke>
 
<burns>Say <quote>goodnight</quote>,
Gracie.</burns>
 
<allen><quote>Goodnight,
Gracie.</quote></allen>
 
<applause/>
 
</oldjoke>

A few things may stand out to you:

  • The document begins with a processing instruction: <?xml ...?>. This is the XML declaration. While it is not required, its presence explicitly identifies the document as an XML document and indicates the version of XML to which it was authored.
  • There’s no document type declaration. Unlike SGML, XML does not require a document type declaration. However, a document type declaration can be supplied, and some documents will require one in order to be understood unambiguously.
  • Empty elements (<applause/> in this example) have a modified syntax. While most elements in a document are wrappers around some content, empty elements are simply markers where something occurs (a horizontal rule for HTML’s <hr> tag, for example, or a cross reference for DocBook’s <xref> tag). The trailing /> in the modified syntax indicates to a program processing the XML document that the element is empty and no matching end-tag should be sought. Since XML documents do not require a document type declaration, without this clue it could be impossible for an XML parser to determine which tags were intentionally empty and which had been left empty by mistake.
    XML has softened the distinction between elements which are declared as EMPTY and elements which merely have no content. In XML, it is legal to use the empty-element tag syntax in either case. It’s also legal to use a start-tag/end-tag pair for empty elements: <applause></applause>. If interoperability is of any concern, it’s best to reserve empty-element tag syntax for elements which are declared as EMPTY and to only use the empty-element tag form for those elements.

XML documents are composed of markup and content. There are six kinds of markup that can occur in an XML document: elements, entity references, comments, processing instructions, marked sections, and document type declarations. The following sections introduce each of these markup concepts.

Elements

Elements are the most common form of markup. Delimited by angle brackets, most elements identify the nature of the content they surround. Some elements may be empty, as seen above, in which case they have no content. If an element is not empty, it begins with a start-tag, <element>, and ends with an end-tag, </element>.

Attributes

Attributes are name-value pairs that occur inside start-tags after the element name. For example,

<div>

is a div element with the attribute class having the value preface. In XML, all attribute values must be quoted.

Entity References

In order to introduce markup into a document, some characters have been reserved to identify the start of markup. The left angle bracket, < , for instance, identifies the beginning of an element start- or end-tag. In order to insert these characters into your document as content, there must be an alternative way to represent them. In XML, entities are used to represent these special characters. Entities are also used to refer to often repeated or varying text and to include the content of external files.

Every entity must have a unique name. Defining your own entity names is discussed in the section on entity declarations. In order to use an entity, you simply reference it by name. Entity references begin with the ampersand and end with a semicolon.

For example, the lt entity inserts a literal < into a document. So the string <element> can be represented in an XML document as &lt;element>.

A special form of entity reference, called a character reference can be used to insert arbitrary Unicode characters into your document. This is a mechanism for inserting characters that cannot be typed directly on your keyboard.

Character references take one of two forms: decimal references, , and hexadecimal references, . Both of these refer to character number U+211E from Unicode (which is the standard Rx prescription symbol, in case you were wondering).

Comments

Comments begin with <!-- and end with -->. Comments can contain any data except the literal string --. You can place comments between markup anywhere in your document.

Comments are not part of the textual content of an XML document. An XML processor is not required to pass them along to an application.

Processing Instructions

Processing instructions (PIs) are an escape hatch to provide information to an application. Like comments, they are not textually part of the XML document, but the XML processor is required to pass them to an application.

Processing instructions have the form: <?name pidata?>. The name, called the PI target, identifies the PI to the application. Applications should process only the targets they recognize and ignore all other PIs. Any data that follows the PI target is optional, it is for the application that recognizes the target. The names used in PIs may be declared as notations in order to formally identify them.

PI names beginning with xml are reserved for XML standardization.

CDATA Sections

In a document, a CDATA section instructs the parser to ignore most markup characters.

Consider a source code listing in an XML document. It might contain characters that the XML parser would ordinarily recognize as markup (< and &, for example). In order to prevent this, a CDATA section can be used.

<![CDATA[
 
*p = &q;
 
b = (i <= 3);
 
]]>
 

Between the start of the section, <![CDATA[ and the end of the section, ]]>, all character data is passed directly to the application, without interpretation. Elements, entity references, comments, and processing instructions are all unrecognized and the characters that comprise them are passed literally to the application.

The only string that cannot occur in a CDATA section is ]]>.

Advertisements

2 Responses to “Introduction to XML”

  1. You really make it seem so easy with your presentation but I find this topic to be really something which I think I would never understand. It seems too complicated and very broad for me. I am looking forward for your next post, I will try to get the hang of it!

  2. Hello, just needed you to know I have added your site to my Google bookmarks because of your extraordinary blog layout. But seriously, I think your site has one of the cleverest theme I’ve came across. It really helps make reading your blog a lot better.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: