lxmlは2.1.1で。
気軽なパースはBeautifulSoupで十分だけど、アホみたいな19世紀に書かれたようなwebページのパースをするのに、XPathがないと死ぬと思ったので、lxmlを試してみた。
UTF-8が宣言されたはてなのトップページをWindows上のIPythonでxpathる。
In [1]: from lxml import etree In [2]: import urllib In [3]: url = 'http://www.hatena.ne.jp/' In [4]: parser = etree.HTMLParser() In [5]: html = urllib.urlopen(url).read() In [40]: et=etree.parse(src,parser) --------------------------------------------------------------------------- <type 'exceptions.UnicodeDecodeError'> Traceback (most recent call last) C:\hoge\<ipython console> in <module>() C:\hoge\lxml.etree.pyx in lxml.etree.parse (src/lxml/lxml.etree.c:22796)() C:\hoge\parser.pxi in lxml.etree._parseDocument (src/lxml/lxml.etree.c:60205)() C:\hoge\parser.pxi in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:60449)() C:\hoge\parser.pxi in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:59596)() C:\hoge\parser.pxi in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:57106)() C:\hoge\parser.pxi in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:53512)() C:\hoge\parser.pxi in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:54372)() C:\hoge\parser.pxi in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53713)() <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xe3 in position 337: ordinal not in range(128) In [41]: from StringIO import StringIO In [42]: et=etree.parse(StringIO(src),parser) In [43]: print etree.tostring(et.getroot().xpath('//title')[0], encoding='sjis', pretty_print=True) <?xml version='1.0' encoding='sjis'?> <title>はてな</title>
StringIOって偉いですね。