XML is a language for describing structured data. In XML or eXtensible Markup Language, user-created tags mark the beginning and end of data values of a particular type. For example, in HTML the tag < ul > represents the begining of a bullet list that is ended by < \ul >. HTML is related to XML but is not a subset of XML because often there are HTML constructs which are not delimited by such parenthesis-like paired tags (e.g., < p > does not require an ending paragraph tag). Every XML document has a root tag which is the outermost tag on the entire document, and serves as its root. Click here to see an example of XML taken from the w3schools tutorial. Note that each tag has a corresponding ending tag and that the strings associated with the tags are created by the user designing the XML document. Notice that < note > is the root tag for this example.
One can define a specification for correct XML data for a specific application using a language called DTD (see source 1 or source 2 or source 3) which was a precursor to the more recent (and more complex) XML schemas. In this assignment, you will be given a DTD description of well-formed bibliographic documents. Your job will be to write a functional parser using higher order functions that checks XML documents for adherence to this DTD specification.
XML documents are built from: elements, tags, attributes, entities, PCDATA, and CDATA. Elements are those constructs in the document which are named by tags; for example, < note > is an element. Tags occur in pairs in XML documents and delimit a specific element. For example, < element_name > marks the beginning of the element and < /element_name > marks the end of that element. Attributes are values that can become associated with an element. For example, in HTML when you include images in a file, they appear in an img element: < img src="./pic.jpeg" / >. The img element is empty so its end is marked by "/", but it has attribute src with value ".pic.jpeg". Entities are variables used to define common text such as "ampersand quot" meaning quotation marks or "ampersand lt" meaning less than symbol. Entities are short cuts for referring to the same text throughout the document, a sort of string macro facility. PCDATA means parsed character data; this is character text which will be parsed as if it is XML. CDATA is character data that will not be parsed.
The following file describes the subset of the DTD language which we will use. This file shows you examples of various DTD constructs from the w3 tutorials referred to above. This language is reminescent of regular expressions with similar operators. Notice that the !ELEMENT, !ATTRIBUTE, !ENTITY are keywords in DTD.
The XML we will be using defines a bibliography format similar to (but not the same as) that of LaTex's bibtex. A copy of the DTD for this bibliography data can be found in bibDTD. This DTD will act as a "grammar" for you to check (i.e., parse) the corresponding XML documents. Note that your checker program will be tailored to check for only this grammar; we are NOT asking you to build a checker generator program. Also, you can assume that the child elements of an element will appear in the XML in the same position they appear left to right in the DTD definition. Please have your parser generate output telling whether or not the test data XML is valid.
We also have constructed a correct XML document, structured according to this DTD for you to use as your initial test data. You can try to view this file in your browser, but it probably won't display correctly :-) Please use the "view source" button in your browser to get the actual file contents.
When you write your own test data, you will need to make sure that you are following the rules of the DTD correctly. We suggest you check this by downloading the Apache XML parser, Xerces2 (http://xerces.apache.org/xerces2-j/), which is written in Java, and run it to parse (i.e., check) your test data. You will need to include the same top two lines of our correct XML document as the first two lines of any test data you create. On my MacG4 (Unix-based) I ran the parser using a program included in their samples directory by executing the following:
java -cp xercesSamples.jar:xercesImpl.jar:xml-apis.jar:resolver.jar sax.Writer -v -xd file://$HOME/correct.xmlall written on one line.
Program parserToshow.sml on paul which was explained in the lecture notes shows how to use higher order functions to accomplish parsing according to a grammar. Your job will be to build a program like that one for our DTD describing bibliographies of interest.
EXTRA CREDIT: It is possible to parse the XML making no assumptions about the position of the child element fields within an element. For example, the order of fields in a bibtex file is arbitrary within a bibtex element. Write a parser in functional programming style that makes no assumptions about child element order. Explain your technque in the documentation of your project.
There will be test data on the programming assignments page at least one week before the assignment is due. We will try to use the handin program (again) and you will be turning in a *.tar file containing the following: