Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html parsing #200

Open
KeenCN opened this issue Dec 6, 2018 · 3 comments
Open

html parsing #200

KeenCN opened this issue Dec 6, 2018 · 3 comments

Comments

@KeenCN
Copy link

KeenCN commented Dec 6, 2018

Hi, when I try to parse a html string, Tested in python command line:

from pyquery import PyQuery as pq
t = pq('<span class="test">&#xe034;.&#xe034;</span>')
o = t("span.test").html()
print(o)
[ . ]

How do I get the original string?

@KeenCN
Copy link
Author

KeenCN commented Dec 6, 2018

That's ok, but it's not what I want

from pyquery import PyQuery as pq
s = '<span class="test">&#xe034;.&#xe034;</span>'
s = s.replace("&", "&amp;")
t = pq(s)
o = t("span.test").html()
print(o)
[ &#xe034;.&#xe034; ]

@CodingMoeButa
Copy link

I have the same problem with you: #218
If it is &lt; that < , the problem would be more serious.

@liquancss
Copy link
Contributor

liquancss commented Sep 29, 2021

"&#xe034" looks like a kind of icon font which means it has nothing to do with this lib.
There must be a font file(like .woff file) to tell the browser how &#xe034 rendered. Without the corresponding font file or wrong font file, "&#xe034" will looks weird or wrong.
This is commonly used in website to protect secret data(like price) from crawlers which called font encryption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants