Ask and Learn

python批量下载265g美图

看到同事看某游戏网站的游戏MM宣传,尺度很大,不过现在的这些图集的帖子,一般都是多多分页来拉流量, 一页就只能看一两张图,很是麻烦,于是打算使用 python 抓取 http://bg.265g.com/yxmn/ 中的美图。

对于图片抓取,其实过程很简单:

  1. 抓取第一页,获取标题及当页图片
  2. 获取下一页地址,下载该页源码并获取图片
  3. 重复步骤2直至最后一页

因为是单文件脚本,所以不打算非 python 标准库,所以只能使用文件分析利器 – 正则表达式

代码:

#-*- coding: utf-8 -*-
import urllib, re, os, argparse

RE_NEXT_URL = ur'<a href=\"(?!javascript\:)([^\"]*?)">下一页<\/a>'
RE_PICTURE  = ur'<p\s+(?:align=\"center\")>\s*<img.*?src=\"(.*?)\".*?\/?>'
RE_ARCTITLE = ur'var arctitle=\'(.*?)\';'

def get_html(url):
    try:
        return urllib.urlopen(url).read().decode('gbk')
    except:
        return None

def get_next_url(url, html):
    m = re.search(RE_NEXT_URL, html, flags=re.I|re.M|re.S)
    href = m and m.group(1) or None
   
    if href:
        if href.startswith('/'):
            return re.sub(r'(http\:\/\/[^\/]+)(.*)', '\\1%s' % href, url)
        else:
            return re.sub(r'([^\/]*)$', href, url)
    else:
        return None

def get_pictures(html):
    return re.findall(RE_PICTURE, html, flags=re.I|re.M|re.S)

def get_title(html):
    m = re.search(RE_ARCTITLE, html, flags=re.I|re.M|re.S)
    return m and m.group(1) or None

def download_pictures(dirname, url, html):
    for picture in get_pictures(html):
        print 'Downloading', picture,
        filename = os.path.join(dirname, picture.split('/')[-1])
        try:
            resp = urllib.urlopen(picture)
            open(filename, 'wb+').write(resp.read())
            print ' [DONE]'
        except:
            print ' [FAIL]'
    
    next_url = get_next_url(url, html)
    print 'Fetching:', next_url
    html = next_url and get_html(next_url) or None
    html and download_pictures(dirname, next_url, html)
    
def execute(url):
    print 'Fetching %s' % url
    html = get_html(url)
    title = html and get_title(html) or None
    print 'Title:', title
    dirname = re.sub(r'\/', '', title)
    os.path.exists(dirname) or os.mkdir(dirname)
    title and download_pictures(dirname, url, html)
    os.system('nautilus "%s"' % dirname.encode('utf-8'))

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Download 265g.com pictures.')
    parser.add_argument('url', help='Start url')
    args = parser.parse_args()
    execute(args.url)

用法:

$ python 265g_pics.py http://bg.265g.com/1209/114289.html

注意:

此段仅在 ubuntu 下测试过,如果没有安装 nautilus,将如下代码注释即可。

os.system('nautilus "%s"' % dirname.encode('utf-8'))