lxmlでutf-8のマルチバイト文字を扱う on IPython on Windows

lxmlは2.1.1で。
気軽なパースはBeautifulSoupで十分だけど、アホみたいな19世紀に書かれたようなwebページのパースをするのに、XPathがないと死ぬと思ったので、lxmlを試してみた。

UTF-8が宣言されたはてなのトップページをWindows上のIPythonでxpathる。

In [1]: from lxml import etree
In [2]: import urllib

In [3]: url = 'http://www.hatena.ne.jp/'
In [4]: parser = etree.HTMLParser()
In [5]: html = urllib.urlopen(url).read()

In [40]: et=etree.parse(src,parser)
---------------------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'>    Traceback (most recent call last)

C:\hoge\<ipython console> in <module>()

C:\hoge\lxml.etree.pyx in lxml.etree.parse (src/lxml/lxml.etree.c:22796)()

C:\hoge\parser.pxi in lxml.etree._parseDocument (src/lxml/lxml.etree.c:60205)()

C:\hoge\parser.pxi in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:60449)()

C:\hoge\parser.pxi in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:59596)()

C:\hoge\parser.pxi in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:57106)()

C:\hoge\parser.pxi in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:53512)()

C:\hoge\parser.pxi in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:54372)()

C:\hoge\parser.pxi in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53713)()

<type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode byte 0xe3 in position 337: ordinal not in range(128)

In [41]: from StringIO import StringIO

In [42]: et=etree.parse(StringIO(src),parser)

In [43]: print etree.tostring(et.getroot().xpath('//title')[0], encoding='sjis', pretty_print=True)
<?xml version='1.0' encoding='sjis'?>
<title>はてな</title>

StringIOって偉いですね。