-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathrss.xml
46 lines (38 loc) · 4.48 KB
/
rss.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="assets/xml/rss.xsl" media="all"?><rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Etienne’s blog</title><link>http://etienned.github.io/</link><description>Python, technology and maybe more</description><atom:link href="http://etienned.github.io/rss.xml" type="application/rss+xml" rel="self"></atom:link><language>en</language><lastBuildDate>Sun, 29 Mar 2015 02:26:05 GMT</lastBuildDate><generator>http://getnikola.com/</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Extract text from Word files (docx) simply</title><link>http://etienned.github.io/posts/extract-text-from-word-docx-simply/</link><dc:creator>Etienne Desautels</dc:creator><description><div><p>If you want to extract the text content of a Word file there are a few <a class="reference external" href="http://stackoverflow.com/questions/42482/best-way-to-extract-text-from-a-word-doc-without-using-com-automation">solutions</a> to do this in Python. Unfortunately most of these solutions have dependencies or need to run an external command in a subprocess or are heavy/complex, using an office suite, etc. I find that the best solution among those in the Stackoverflow page is <a class="reference external" href="https://github.com/mikemaccana/python-docx">python-docx</a>. But using it bring two dependencies: <em>python-docx</em> itself and <a class="reference external" href="http://lxml.de/">lxml</a>. Installing <em>python-docx</em> is not a big problem. Unfortunately <em>lxml</em> is sometimes hard to install or, at the minimum, requires compilation.</p>
<p>To avoid that, inspired by <em>python-docx</em>, I created a simple function to extract text from <em>.docx</em> files that do not require dependencies, using only the standard library. So it’s easy to incorporate it in any Python project.</p>
<p>Is there any way to improve it?</p>
<script src="https://gist.github.com/7539105.js"></script><noscript><pre class="literal-block">
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx &lt;https://github.com/mikemaccana/python-docx&gt;)
"""
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
</pre>
</noscript></div></description><category>Python</category><guid>http://etienned.github.io/posts/extract-text-from-word-docx-simply/</guid><pubDate>Tue, 26 Nov 2013 10:01:59 GMT</pubDate></item><item><title>Using a Raspberry Pi as an air exchanger controller</title><link>http://etienned.github.io/posts/raspberry-pi-as-air-exchanger-controller/</link><dc:creator>Etienne Desautels</dc:creator><description><div class="section" id="the-margarita-project">
<h2>The Margarita project</h2>
<p>This project’s goal is to build a controller for my house’s air exchanger that will optimize its utilization. It will take into account exterior and interior temperature and humidity to decide what the exchanger should do. In a second time I will add a web/phone interface to access and set the controller. In fact, I have a lot of other ideas and goals, but better to not make too many promises!</p>
<p><a href="http://etienned.github.io/posts/raspberry-pi-as-air-exchanger-controller/">Read more…</a> (6 min remaining to read)</p></div></description><category>Margarita</category><category>Python</category><category>Raspberry Pi</category><guid>http://etienned.github.io/posts/raspberry-pi-as-air-exchanger-controller/</guid><pubDate>Sat, 23 Nov 2013 10:30:03 GMT</pubDate></item></channel></rss>