module lang::xml::IO
Basic IO for XML to Rascal and back
Usage
import lang::xml::IO;
Dependencies
import util::Maybe;
Description
The XML binding implemented by this module is untyped. The readers and streamers produce
values of type node for every (nested) tag.
To bind the resulting values to more strictly typed ADTs, use Validator.
function readXML
value readXML(loc file, bool fullyQualify=false, bool trackOrigins = false, bool includeEndTags=false, bool ignoreComments=true, bool ignoreWhitespace=true, str charset="UTF-8", bool inferCharset=!(charset?))
function streamXML
Stream all the tags in a file, one-by-one, without ever having the entire XML file in memory.
Maybe[value]() streamXML(loc file, str elementName, bool fullyQualify=false, bool trackOrigins = false, bool includeEndTags=false, bool ignoreComments=true, bool ignoreWhitespace=true, str charset="UTF-8", bool inferCharset=!(charset?))
Stream X ML returns a closure function. When you call it repeatedly, it will produce a single value just(...) for each
occurrence of elementName tags in the input. The final call will produce nothing(), so you know when to stop.
IO exceptions can still be thrown even when you are already streaming. This means an entire file has dissappeared,
or permissions were revoked during the execution of the stream. Only when you receive nothing() it is indicated
that the elementName tag is not further present in the file.
Examples
rascal>import IO;
ok
a (prefix of) an example XML file from the web
rascal>readFile(|https://www.w3schools.com/xml/cd_catalog.xml|(0,500))
str: "\<?xml version=\"1.0\" encoding=\"UTF-8\"?\>\n\<CATALOG\>\n \<CD\>\n \<TITLE\>Empire Burlesque\</TITLE\>\n \<ARTIST\>Bob Dylan\</ARTIST\>\n \<COUNTRY\>USA\</COUNTRY\>\n \<COMPANY\>Columbia\</COMPANY\>\n \<PRICE\>10.90\</PRICE\>\n \<YEAR\>1985\</YEAR\>\n \</CD\>\n \<CD\>\n \<TITLE\>Hide your heart\</TITLE\>\n \<ARTIST\>Bonnie Tyler\</ARTIST\>\n \<COUNTRY\>UK\</COUNTRY\>\n \<COMPANY\>CBS Records\</COMPANY\>\n \<PRICE\>9.90\</PRICE\>\n \<YEAR\>1988\</YEAR\>\n \</CD\>\n \<CD\>\n \<TITLE\>Greatest Hits\</TITLE\>\n \<ARTIST\>Dolly Parton\</ARTIST"
───
<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD>
<CD>
<TITLE>Greatest Hits</TITLE>
<ARTIST>Dolly Parton</ARTIST
───
rascal>import lang::xml::IO;
ok
let's read every CD one-by-one
rascal>nextCD = streamXML(|https://www.w3schools.com/xml/cd_catalog.xml|, "CD");
Maybe[value] (): function(|std:///lang/xml/IO.rsc|(3050,8,<54,231>,<54,239>))
every time we call nextCD we get the next one, until the end
rascal>nextCD()
Maybe[value]: just("cd"(
"title"("Empire Burlesque"),
"artist"("Bob Dylan"),
"country"("USA"),
"company"("Columbia"),
"price"("10.90"),
"year"("1985")))
rascal>nextCD()
Maybe[value]: just("cd"(
"title"("Hide your heart"),
"artist"("Bonnie Tyler"),
"country"("UK"),
"company"("CBS Records"),
"price"("9.90"),
"year"("1988")))
or we get the next 500, filtering the final nothing() results:
rascal>[ cd | _ <- [0..500], just(cd) := nextCD()]
list[node]: [
"cd"(
"title"("Greatest Hits"),
"artist"("Dolly Parton"),
"country"("USA"),
"company"("RCA"),
"price"("9.90"),
"year"("1982")),
"cd"(
"title"("Still got the blues"),
"artist"("Gary Moore"),
"country"("UK"),
"company"("Virgin records"),
"price"("10.20"),
"year"("1990")),
"cd"(
"title"("Eros"),
"artist"("Eros Ramazzotti"),
"country"("EU"),
"company"("BMG"),
"price"("9.90"),
"year"("1997")),
"cd"(
"title"("One night only"),
"artist"("Bee Gees"),
"country"("UK"),
"company"("Polydor"),
"price"("10.90"),
"year"("1998")),
"cd"(
"title"("Sylvias Mother"),
"artist"("Dr.Hook"),
"country"("UK"),
"company"("CBS"),
"price"("8.10"),
"year"("1973")),
"cd"(
"title"("Maggie May"),
"artist"("Rod Stewart"),
"country"("UK"),
"company"("Pickwick"),
"price"("8.50"),
"year"("1990")),
"cd"(
"title"("Romanza"),
"artist"("Andrea Bocelli"),
"country"("EU"),
"company"("Polydor"),
"price"("10.80"),
"year"("1996")),
"cd"(
"title"("When a man loves a woman"),
"artist"("Percy Sledge"),
"country"("USA"),
"company"("Atlantic"),
"price"("8.70"),
"year"("1987")),
"cd"(
"title"("Black angel"),
"artist"("Savage Rose"),
"country"("EU"),
"company"("Mega"),
"price"("10.90"),
"year"("1995")),
"cd"(
"title"("1999 Grammy Nominees"),
"artist"("Many"),
"country"("USA"),
"company"("Grammy"),
"price"("10.20"),
"year"("1999")),
"cd"(
"title"("For the good times"),
"artist"("Kenny Rogers"),
"country"("UK"),
"company"("Mucik Master"),
"price"("8.70"),
"year"("1995")),
"cd"(
"title"("Big Willie style"),
"artist"("Will Smith"),
"country"("USA"),
"company"("Columbia"),
"price"("9.90"),
"year"("1997")),
"cd"(
"title"("Tupelo Honey"),
"artist"("Van Morrison"),
"country"("UK"),
"company"("Polydor"),
"price"("8.20"),
"year"("1971")),
"cd"(
"title"("Soulsville"),
"artist"("Jorn Hoel"),
"country"("Norway"),
"company"("WEA"),
"price"("7.90"),
"year"("1996")),
"cd"(
"title"("The very best of"),
"artist"("Cat Stevens"),
"country"("UK"),
"company"("Island"),
"price"("8.90"),
"year"("1990")),
"cd"(
"title"("Stop"),
"artist"("Sam Brown"),
"country"("UK"),
"company"("A and M"),
"price"("8.90"),
"year"("1988")),
"cd"(
"title"("Bridge of Spies"),
"artist"("T\'Pau"),
"country"("UK"),
"company"("Siren"),
"price"("7.90"),
"year"("1987")),
"cd"(
"title"("Private Dancer"),
"artist"("Tina Turner"),
"country"("UK"),
"company"("Capitol"),
"price"("8.90"),
"year"("1983")),
"cd"(
"title"("Midt om natten"),
"artist"("Kim Larsen"),
"country"("EU"),
"company"("Medley"),
"price"("7.80"),
"year"("1983")),
"cd"(
"title"("Pavarotti Gala Concert"),
"artist"("Luciano Pavarotti"),
"country"("UK"),
"company"("DECCA"),
"price"("9.90"),
"year"("1991")),
"cd"(
"title"("The dock of the bay"),
"artist"("Otis Redding"),
"country"("USA"),
"company"("Stax Records"),
"price"("7.90"),
"year"("1968")),
"cd"(
"title"("Picture book"),
"artist"("Simply Red"),
"country"("EU"),
"company"("Elektra"),
"price"("7.20"),
"year"("1985")),
"cd"(
"title"("Red"),
"artist"("The Communards"),
"country"("UK"),
"company"("London"),
"price"("7.80"),
"year"("1987")),
"cd"(
"title"("Unchain my heart"),
"artist"("Joe Cocker"),
"country"("USA"),
"company"("EMI"),
"price"("8.20"),
"year"("1987"))
]
Benefits
- Low latency for accessing the first element in a long stream, and then the next and the next.
- Low (constant) memory usage because only one selected element is active at a time on the heap. This works particularly well for XML documents that have huge amounts of sibling elements, like database table dumps.
Pitfalls
- Selection of
elementNamegreatly influences memory usage. If you select a child of a repeated structure only the child is clean up, while the parent structure remains. Memory will grow linearly with the amount of parent structures again, defeating the point of calling Stream X ML. - Lower throughput for processing enormous documents. Compared to Read X ML, and only if enough memory is available
to store both the internal DOM and the Rascal
nodestructure, Stream X ML reaches a lower throughput because of the function call overhead for each next element. If you do run out of memory with Read X ML though, then Stream X ML reaches exponentially higher throughput than Read X ML.
function readXML
value readXML(str contents, loc src = |unknown:///|, bool fullyQualify=false, bool trackOrigins = false, bool includeEndTags=false, bool ignoreComments=true, bool ignoreWhitespace=true)
function writeXMLString
Pretty-print any value as an XML string
str writeXMLString(value val, str charset="UTF-8", bool outline=false, bool prettyPrint=true, int indentAmount=4, int maxPaddingWidth=30, bool dropOrigins=true)
This function uses JSoup's DOM functionality to yield a syntactically correct XML string.
function writeXMLFile
Pretty-print any value to an XML file
void writeXMLFile(loc file, value val, str charset="UTF-8", bool outline=false, bool prettyPrint=true, int indentAmount=4, int maxPaddingWidth=30, bool dropOrigins=true)
This function uses JSoup's DOM functionality to yield a syntactically correct (X)HTML file.
Tests
test nestedElementTest
test bool nestedElementTest() {
example = "\<aap\>\<noot\>mies\</noot\>\</aap\>";
val = readXML(example);
return val == "aap"("noot"("mies"));
}
test attributeTest
test bool attributeTest() {
example = "\<aap age=\"1\"\>\</aap\>";
val = readXML(example);
return val == "aap"(age="1");
}
test namespaceTest
test bool namespaceTest() {
example = "\<aap xmlns:ns=\"http://trivial\" ns:age=\"1\" age=\"2\"\>\</aap\>";
val = readXML(example);
return "aap"(\ns-age="1", age="2") := val;
}