-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when parsing XHTML #199
Comments
I'ts probably caused by some underlying lxml behaviours. It's always hard to figure. I think that lxml do not print xml headers (doctype) and remove one of the xmlns because they use the same attribute name. As I remember using |
@gawel Thanks for ur help. To find out why def __str__(self):
encoding = str if PY3k else None
return ''.join([etree.tostring(e, encoding=encoding) for e in self])
...
@with_camel_case_alias
def outer_html(self, method="html"):
"""Get the html representation of the first selected element::
>>> d = PyQuery('<div><span class="red">toto</span> rocks</div>')
>>> print(d('span'))
<span class="red">toto</span> rocks
>>> print(d('span').outer_html())
<span class="red">toto</span>
>>> print(d('span').outerHtml())
<span class="red">toto</span>
>>> S = PyQuery('<p>Only <b>me</b> & myself</p>')
>>> print(S('b').outer_html())
<b>me</b>
..
"""
if not self:
return None
e0 = self[0]
if e0.tail:
e0 = deepcopy(e0)
e0.tail = ''
return etree.tostring(e0, encoding=text_type, method=method) It seems In [6]: help(etree.tostring)
Help on cython_function_or_method in module lxml.etree:
tostring(element_or_tree, *, encoding=None, method='xml', xml_declaration=None, pretty_print=False, with_tail=True, standalone=None, doctype=None, exclusive=False, with_comments=True, inclusive_ns_prefixes=None)
... As you can see, |
Yeah the method should be preserved. As I remember it's hard because we use |
@gawel I see it. Talking about the DOCTYPE stuff, I find something that may be helpful. The The ElementTree class, doc of lxml
In [1]: from lxml import etree
In [2]: root = etree.XML('''\
...: ... <?xml version="1.0"?>
...: ... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
...: ... <root>
...: ... <a>&tasty;</a>
...: ... </root>
...: ... ''')
In [3]: tree = etree.ElementTree(root)
In [4]: tree.docinfo.public_id = '-//W3C//DTD XHTML 1.0 Transitional//EN'
In [5]: >>> tree.docinfo.system_url = 'file://local.dtd'
In [6]: tree.docinfo.system_url = 'file://local.dtd'
In [8]: print(etree.tounicode(tree))
<!DOCTYPE root PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://local.dtd" [
<!ENTITY tasty "parsnips">
]>
<root>
<a>parsnips</a>
</root>
In [10]: result=tree.getroot()
In [11]: type(result)
Out[11]: lxml.etree._Element
In [12]: print(etree.tounicode(result))
<root>
<a>parsnips</a>
</root> In [6]: from io import StringIO
In [7]: root = etree.parse(StringIO('''\
...: <?xml version="1.0"?>
...: <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
...: <root>
...: <a>&tasty;</a>
...: </root>
...: '''))
In [8]: type(root)
Out[8]: lxml.etree._ElementTree
In [10]: print(etree.tounicode(root))
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "parsnips">
]>
<root>
<a>parsnips</a>
</root> |
pyquery already use etree https://github.com/gawel/pyquery/blob/master/pyquery/pyquery.py#L96 The problem is that html is not xml and most web pages fails |
I know PyQuery is using
In the source of |
I see. Don't know why it use getroot. this is some pretty old code. probably here since the first commit :D |
A problem occurred when I was parsing an XHML file from index page of pipenv doc: https://pipenv.readthedocs.io/en/latest/
Here's part of the codes I'm using:
Basically, I'm trying to remove the badges from the XHTML file with the help of
html
parser.The original header and
<!DOCTYPE>
are displayed below:After running the script once, badges were removed successfully. I noticed the
<!DOCTYPE>
was omitted byf.write(str(doc))
:The more confusing thing is, after running the script again (the 2nd time), of course there's no badges to be removed, the style of the XHTML file was changed once more.
<html>
attributexmlns
was omitted,<script></script>
tag was changed as<script/>
:Obviously, the 2nd running of the script made the XHTML invalid. I couldn't figure out what's wrong. Is this caused by the parser
paser='html'
, or my wrong use of pyquery to modify local XHTML file byf.write(str(doc))
?The text was updated successfully, but these errors were encountered: