parse dtd/entity #12

daviehh · 2023-04-27T21:52:12Z

Not sure if this is within the scope of this package, but currently it seems the DTD may not be correctly parsed, such as entity tags. For example, with this file as test.xml

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE note [
<!ENTITY nbsp "&#xA0;">
<!ENTITY writer "Writer: Donald Duck.">
<!ENTITY copyright "Copyright: W3Schools.">
]>

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<footer>&writer;&nbsp;&copyright;</footer>
</note>

using EzXML.jl or in browser, the footer part is parsed as "Writer: Donald Duck. Copyright: W3Schools."

using EzXML
doc = readxml("test.xml")
doc.root |> eachelement |> collect |> last |> nodecontent |> println
doc.node.owner = TextNode("") # skip gc

but with XML.jl, they are verbatim strings &writer; &copyright;

using XML
doc2 = read("test.xml", Node)
doc2[end][end][1] |> x -> x.value |> println

in addition, glancing over doc2 it appears the DTD part may not be correctly parsed, e.g. doc2[2] is

Node DTD <!DOCTYPE note [
<!ENTITY nbsp "&#xA0;">

i.e. it matches the next ">" instead of the closing ">" for "<!DOCTYPE"

XML.jl/src/raw.jl

Line 262 in 53d7ed3

j = findnext(==(UInt8('>')), data, i)

Thanks!

The text was updated successfully, but these errors were encountered:

joshday · 2023-04-28T12:06:53Z

Thanks for the report. Parsing DTD is within scope of this package. For now, I was trying to dump everything into the Node's value and figure out parsing later. As you pointed out, that doesn't quite work because it matches the wrong ending tag. I'll work on a fix.

joshday · 2023-04-28T15:52:43Z

Quick fix is done for reading the DTD:

julia> parse(s, Node)[2]
# Node DTD <!DOCTYPE note [
# <!ENTITY nbsp "&#xA0;">
# <!ENTITY writer "Writer: Donald Duck.">
# <!ENTITY copyright "Copyright: W3Schools.">
# ]>

using EzXML.jl or in browser, the footer part is parsed as "Writer: Donald Duck. Copyright: W3Schools."

I'd argue that the Text Node's value ought to be "&writer; &copyright;" to keep the separation of concerns (https://en.wikipedia.org/wiki/Separation_of_content_and_presentation).

That being said I see a use for a fill_entities!(::Node) function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parse dtd/entity #12

parse dtd/entity #12

daviehh commented Apr 27, 2023 •

edited

Loading

joshday commented Apr 28, 2023

joshday commented Apr 28, 2023

parse dtd/entity #12

parse dtd/entity #12

Comments

daviehh commented Apr 27, 2023 • edited Loading

joshday commented Apr 28, 2023

joshday commented Apr 28, 2023

daviehh commented Apr 27, 2023 •

edited

Loading