Skip to main content

module lang::xml::IO

rascal-Not specified

Basic IO for XML to Rascal and back

Usage

import lang::xml::IO;

Dependencies

import util::Maybe;

Description

The XML binding implemented by this module is untyped. The readers and streamers produce values of type node for every (nested) tag.

To bind the resulting values to more strictly typed ADTs, use Validator.

function readXML

value readXML(loc file, bool fullyQualify=false, bool trackOrigins = false, bool includeEndTags=false, bool ignoreComments=true, bool ignoreWhitespace=true, str charset="UTF-8", bool inferCharset=!(charset?))

function streamXML

Stream all the tags in a file, one-by-one, without ever having the entire XML file in memory.

Maybe[value]() streamXML(loc file, str elementName, bool fullyQualify=false, bool trackOrigins = false, bool includeEndTags=false, bool ignoreComments=true, bool ignoreWhitespace=true, str charset="UTF-8", bool inferCharset=!(charset?))

Stream X ML returns a closure function. When you call it repeatedly, it will produce a single value just(...) for each occurrence of elementName tags in the input. The final call will produce nothing(), so you know when to stop.

IO exceptions can still be thrown even when you are already streaming. This means an entire file has dissappeared, or permissions were revoked during the execution of the stream. Only when you receive nothing() it is indicated that the elementName tag is not further present in the file.

Examples

rascal>import IO;
ok

a (prefix of) an example XML file from the web

rascal>readFile(|https://www.w3schools.com/xml/cd_catalog.xml|(0,500))
str: "\<?xml version=\"1.0\" encoding=\"UTF-8\"?\>\n\<CATALOG\>\n \<CD\>\n \<TITLE\>Empire Burlesque\</TITLE\>\n \<ARTIST\>Bob Dylan\</ARTIST\>\n \<COUNTRY\>USA\</COUNTRY\>\n \<COMPANY\>Columbia\</COMPANY\>\n \<PRICE\>10.90\</PRICE\>\n \<YEAR\>1985\</YEAR\>\n \</CD\>\n \<CD\>\n \<TITLE\>Hide your heart\</TITLE\>\n \<ARTIST\>Bonnie Tyler\</ARTIST\>\n \<COUNTRY\>UK\</COUNTRY\>\n \<COMPANY\>CBS Records\</COMPANY\>\n \<PRICE\>9.90\</PRICE\>\n \<YEAR\>1988\</YEAR\>\n \</CD\>\n \<CD\>\n \<TITLE\>Greatest Hits\</TITLE\>\n \<ARTIST\>Dolly Parton\</ARTIST"
───
<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD>
<CD>
<TITLE>Greatest Hits</TITLE>
<ARTIST>Dolly Parton</ARTIST
───
rascal>import lang::xml::IO;
ok

let's read every CD one-by-one

rascal>nextCD = streamXML(|https://www.w3schools.com/xml/cd_catalog.xml|, "CD");
Maybe[value] (): function(|std:///lang/xml/IO.rsc|(3050,8,<54,231>,<54,239>))

every time we call nextCD we get the next one, until the end

rascal>nextCD()
Maybe[value]: just("cd"(
"title"("Empire Burlesque"),
"artist"("Bob Dylan"),
"country"("USA"),
"company"("Columbia"),
"price"("10.90"),
"year"("1985")))
rascal>nextCD()
Maybe[value]: just("cd"(
"title"("Hide your heart"),
"artist"("Bonnie Tyler"),
"country"("UK"),
"company"("CBS Records"),
"price"("9.90"),
"year"("1988")))

or we get the next 500, filtering the final nothing() results:

rascal>[ cd | _ <- [0..500], just(cd) := nextCD()]
list[node]: [
"cd"(
"title"("Greatest Hits"),
"artist"("Dolly Parton"),
"country"("USA"),
"company"("RCA"),
"price"("9.90"),
"year"("1982")),
"cd"(
"title"("Still got the blues"),
"artist"("Gary Moore"),
"country"("UK"),
"company"("Virgin records"),
"price"("10.20"),
"year"("1990")),
"cd"(
"title"("Eros"),
"artist"("Eros Ramazzotti"),
"country"("EU"),
"company"("BMG"),
"price"("9.90"),
"year"("1997")),
"cd"(
"title"("One night only"),
"artist"("Bee Gees"),
"country"("UK"),
"company"("Polydor"),
"price"("10.90"),
"year"("1998")),
"cd"(
"title"("Sylvias Mother"),
"artist"("Dr.Hook"),
"country"("UK"),
"company"("CBS"),
"price"("8.10"),
"year"("1973")),
"cd"(
"title"("Maggie May"),
"artist"("Rod Stewart"),
"country"("UK"),
"company"("Pickwick"),
"price"("8.50"),
"year"("1990")),
"cd"(
"title"("Romanza"),
"artist"("Andrea Bocelli"),
"country"("EU"),
"company"("Polydor"),
"price"("10.80"),
"year"("1996")),
"cd"(
"title"("When a man loves a woman"),
"artist"("Percy Sledge"),
"country"("USA"),
"company"("Atlantic"),
"price"("8.70"),
"year"("1987")),
"cd"(
"title"("Black angel"),
"artist"("Savage Rose"),
"country"("EU"),
"company"("Mega"),
"price"("10.90"),
"year"("1995")),
"cd"(
"title"("1999 Grammy Nominees"),
"artist"("Many"),
"country"("USA"),
"company"("Grammy"),
"price"("10.20"),
"year"("1999")),
"cd"(
"title"("For the good times"),
"artist"("Kenny Rogers"),
"country"("UK"),
"company"("Mucik Master"),
"price"("8.70"),
"year"("1995")),
"cd"(
"title"("Big Willie style"),
"artist"("Will Smith"),
"country"("USA"),
"company"("Columbia"),
"price"("9.90"),
"year"("1997")),
"cd"(
"title"("Tupelo Honey"),
"artist"("Van Morrison"),
"country"("UK"),
"company"("Polydor"),
"price"("8.20"),
"year"("1971")),
"cd"(
"title"("Soulsville"),
"artist"("Jorn Hoel"),
"country"("Norway"),
"company"("WEA"),
"price"("7.90"),
"year"("1996")),
"cd"(
"title"("The very best of"),
"artist"("Cat Stevens"),
"country"("UK"),
"company"("Island"),
"price"("8.90"),
"year"("1990")),
"cd"(
"title"("Stop"),
"artist"("Sam Brown"),
"country"("UK"),
"company"("A and M"),
"price"("8.90"),
"year"("1988")),
"cd"(
"title"("Bridge of Spies"),
"artist"("T\'Pau"),
"country"("UK"),
"company"("Siren"),
"price"("7.90"),
"year"("1987")),
"cd"(
"title"("Private Dancer"),
"artist"("Tina Turner"),
"country"("UK"),
"company"("Capitol"),
"price"("8.90"),
"year"("1983")),
"cd"(
"title"("Midt om natten"),
"artist"("Kim Larsen"),
"country"("EU"),
"company"("Medley"),
"price"("7.80"),
"year"("1983")),
"cd"(
"title"("Pavarotti Gala Concert"),
"artist"("Luciano Pavarotti"),
"country"("UK"),
"company"("DECCA"),
"price"("9.90"),
"year"("1991")),
"cd"(
"title"("The dock of the bay"),
"artist"("Otis Redding"),
"country"("USA"),
"company"("Stax Records"),
"price"("7.90"),
"year"("1968")),
"cd"(
"title"("Picture book"),
"artist"("Simply Red"),
"country"("EU"),
"company"("Elektra"),
"price"("7.20"),
"year"("1985")),
"cd"(
"title"("Red"),
"artist"("The Communards"),
"country"("UK"),
"company"("London"),
"price"("7.80"),
"year"("1987")),
"cd"(
"title"("Unchain my heart"),
"artist"("Joe Cocker"),
"country"("USA"),
"company"("EMI"),
"price"("8.20"),
"year"("1987"))
]

Benefits

  • Low latency for accessing the first element in a long stream, and then the next and the next.
  • Low (constant) memory usage because only one selected element is active at a time on the heap. This works particularly well for XML documents that have huge amounts of sibling elements, like database table dumps.

Pitfalls

  • Selection of elementName greatly influences memory usage. If you select a child of a repeated structure only the child is clean up, while the parent structure remains. Memory will grow linearly with the amount of parent structures again, defeating the point of calling Stream X ML.
  • Lower throughput for processing enormous documents. Compared to Read X ML, and only if enough memory is available to store both the internal DOM and the Rascal node structure, Stream X ML reaches a lower throughput because of the function call overhead for each next element. If you do run out of memory with Read X ML though, then Stream X ML reaches exponentially higher throughput than Read X ML.

function readXML

value readXML(str contents, loc src = |unknown:///|, bool fullyQualify=false, bool trackOrigins = false, bool includeEndTags=false, bool ignoreComments=true, bool ignoreWhitespace=true)

function writeXMLString

Pretty-print any value as an XML string

str writeXMLString(value val, str charset="UTF-8", bool outline=false, bool prettyPrint=true, int indentAmount=4, int maxPaddingWidth=30, bool dropOrigins=true)

This function uses JSoup's DOM functionality to yield a syntactically correct XML string.

function writeXMLFile

Pretty-print any value to an XML file

void writeXMLFile(loc file, value val, str charset="UTF-8", bool outline=false, bool prettyPrint=true, int indentAmount=4, int maxPaddingWidth=30, bool dropOrigins=true)

This function uses JSoup's DOM functionality to yield a syntactically correct (X)HTML file.

Tests

test nestedElementTest

test bool nestedElementTest() {
example = "\<aap\>\<noot\>mies\</noot\>\</aap\>";

val = readXML(example);

return val == "aap"("noot"("mies"));
}

test attributeTest

test bool attributeTest() {
example = "\<aap age=\"1\"\>\</aap\>";

val = readXML(example);

return val == "aap"(age="1");
}

test namespaceTest

test bool namespaceTest() {
example = "\<aap xmlns:ns=\"http://trivial\" ns:age=\"1\" age=\"2\"\>\</aap\>";

val = readXML(example);

return "aap"(\ns-age="1", age="2") := val;
}