看到同事看某游戏网站的游戏MM宣传,尺度很大,不过现在的这些图集的帖子,一般都是多多分页来拉流量, 一页就只能看一两张图,很是麻烦,于是打算使用 python 抓取 http://bg.265g.com/yxmn/ 中的美图。
对于图片抓取,其实过程很简单:
- 抓取第一页,获取标题及当页图片
- 获取下一页地址,下载该页源码并获取图片
- 重复步骤2直至最后一页
因为是单文件脚本,所以不打算非 python 标准库,所以只能使用文件分析利器 – 正则表达式
代码:
#-*- coding: utf-8 -*-
import urllib, re, os, argparse
RE_NEXT_URL = ur'<a href=\"(?!javascript\:)([^\"]*?)">下一页<\/a>'
RE_PICTURE = ur'<p\s+(?:align=\"center\")>\s*<img.*?src=\"(.*?)\".*?\/?>'
RE_ARCTITLE = ur'var arctitle=\'(.*?)\';'
def get_html(url):
try:
return urllib.urlopen(url).read().decode('gbk')
except:
return None
def get_next_url(url, html):
m = re.search(RE_NEXT_URL, html, flags=re.I|re.M|re.S)
href = m and m.group(1) or None
if href:
if href.startswith('/'):
return re.sub(r'(http\:\/\/[^\/]+)(.*)', '\\1%s' % href, url)
else:
return re.sub(r'([^\/]*)$', href, url)
else:
return None
def get_pictures(html):
return re.findall(RE_PICTURE, html, flags=re.I|re.M|re.S)
def get_title(html):
m = re.search(RE_ARCTITLE, html, flags=re.I|re.M|re.S)
return m and m.group(1) or None
def download_pictures(dirname, url, html):
for picture in get_pictures(html):
print 'Downloading', picture,
filename = os.path.join(dirname, picture.split('/')[-1])
try:
resp = urllib.urlopen(picture)
open(filename, 'wb+').write(resp.read())
print ' [DONE]'
except:
print ' [FAIL]'
next_url = get_next_url(url, html)
print 'Fetching:', next_url
html = next_url and get_html(next_url) or None
html and download_pictures(dirname, next_url, html)
def execute(url):
print 'Fetching %s' % url
html = get_html(url)
title = html and get_title(html) or None
print 'Title:', title
dirname = re.sub(r'\/', '', title)
os.path.exists(dirname) or os.mkdir(dirname)
title and download_pictures(dirname, url, html)
os.system('nautilus "%s"' % dirname.encode('utf-8'))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Download 265g.com pictures.')
parser.add_argument('url', help='Start url')
args = parser.parse_args()
execute(args.url)
用法:
$ python 265g_pics.py http://bg.265g.com/1209/114289.html
注意:
此段仅在 ubuntu 下测试过,如果没有安装 nautilus,将如下代码注释即可。
os.system('nautilus "%s"' % dirname.encode('utf-8'))