Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse dtd/entity #12

Open
daviehh opened this issue Apr 27, 2023 · 2 comments
Open

parse dtd/entity #12

daviehh opened this issue Apr 27, 2023 · 2 comments

Comments

@daviehh
Copy link

daviehh commented Apr 27, 2023

Not sure if this is within the scope of this package, but currently it seems the DTD may not be correctly parsed, such as entity tags. For example, with this file as test.xml

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE note [
<!ENTITY nbsp "&#xA0;">
<!ENTITY writer "Writer: Donald Duck.">
<!ENTITY copyright "Copyright: W3Schools.">
]>

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
<footer>&writer;&nbsp;&copyright;</footer>
</note>

using EzXML.jl or in browser, the footer part is parsed as "Writer: Donald Duck. Copyright: W3Schools."

using EzXML
doc = readxml("test.xml")
doc.root |> eachelement |> collect |> last |> nodecontent |> println
doc.node.owner = TextNode("") # skip gc

but with XML.jl, they are verbatim strings &writer;&nbsp;&copyright;

using XML
doc2 = read("test.xml", Node)
doc2[end][end][1] |> x -> x.value |> println

in addition, glancing over doc2 it appears the DTD part may not be correctly parsed, e.g. doc2[2] is

Node DTD <!DOCTYPE note [
<!ENTITY nbsp "&#xA0;">

i.e. it matches the next ">" instead of the closing ">" for "<!DOCTYPE"

j = findnext(==(UInt8('>')), data, i)

Thanks!

@joshday
Copy link
Member

joshday commented Apr 28, 2023

Thanks for the report. Parsing DTD is within scope of this package. For now, I was trying to dump everything into the Node's value and figure out parsing later. As you pointed out, that doesn't quite work because it matches the wrong ending tag. I'll work on a fix.

@joshday
Copy link
Member

joshday commented Apr 28, 2023

Quick fix is done for reading the DTD:

julia> parse(s, Node)[2]
# Node DTD <!DOCTYPE note [
# <!ENTITY nbsp "&#xA0;">
# <!ENTITY writer "Writer: Donald Duck.">
# <!ENTITY copyright "Copyright: W3Schools.">
# ]>

using EzXML.jl or in browser, the footer part is parsed as "Writer: Donald Duck. Copyright: W3Schools."

I'd argue that the Text Node's value ought to be "&writer;&nbsp;&copyright;" to keep the separation of concerns (https://en.wikipedia.org/wiki/Separation_of_content_and_presentation).

That being said I see a use for a fill_entities!(::Node) function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants