Bouldering & Com.

BeautifulSoupによるスクレイピングの練習

Python

練習その1

aタグのhref属性の列挙

import urllib
import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen('url'))

for _a in soup.findAll('a'):
    _a.get('href')

練習その2

aタグであり
href属性にhtmlという文字列が含まれており
子要素としてimgタグを持っており
子要素のimgタグのsrc属性がjpgという文字列を含んでいる

import urllib
import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen('url'))

soup.findAll(lambda tag : tag.name == 'a' and
                          'html' in tag.get('href') and
                          tag.find('img') != None and
                          'jpg' in tag.find('img').get('src'))