content type

　　最初只是简单抓取没有问题，现在要在线上做抓取时发现很多问题。比如：长时间使用报500错误，需要cookie，有的网站有gzip压缩。本段代码已经解决以上问题，但是字符集问题没有处理，因为我要抓的页面没字符问题。我将代码放在tornado上跑，分析的服务器请求后直接抓取返回信息给分析的服务器。

import urllib2
import cookielib

def get_url_context(_url):
    # cookie
    cj = cookielib.CookieJar()
    _myopener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    _req = urllib2.Request("http://%s" % _url)
    # add head
    _req.add_header("Accept-Language", "zh-cn")
    _req.add_header("Content-Type", "text/html; charset=utf-8")
    _req.add_header("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)")
    # open
    _get_page_data = _myopener.open(_req)
    _get_headers = _get_page_data.info()
    _get_rawdata = _get_page_data.read()
    _get_page_data.close()
    # check gzip
    if ('Content-Encoding' in _get_headers and _get_headers['Content-Encoding']) or \
        ('content-encoding' in _get_headers and _get_headers['content-encoding']):
        import gzip
        import StringIO
        data = StringIO.StringIO(_get_rawdata)
        gz = gzip.GzipFile(fileobj=data)
        _get_rawdata = gz.read()
        gz.close()
    return _get_rawdata

get_page测试代码

一	二	三	四	五	六	日
« 1月
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

simonzhang的家

有朋自远方来。。。。。

Tag Archives: content type