6月 17

tornado学习笔记(一)

学习地址 http://www.tornadoweb.cn/documentation
http://sebug.net/paper/books/tornado/#tornado-walkthrough。
本文只是学习笔记

一、部署

  开始学习。服务器使用内部pc服务器,地址为192.168.1.41,操作系统为centos 5.6,python2.6。安装tornado环境太简单了,两条命令如下:
# easy_install tornado
# yum install pycurl

  环境安完,开始学习。按照说明拷贝代码命名为main.py代码如下:

import tornado.ioloop
import tornado.web

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write("Hello, world")

application = tornado.web.Application([
    (r"/", MainHandler),
])

if __name__ == "__main__":
    application.listen(8888)
    tornado.ioloop.IOLoop.instance().start()

运行服务
python main.py

直接在浏览器中输入http://192.168.1.41:8888,看到了经典的东西了。tornado是单进程,在多核服务器上,我器多个进程,然后用nginx做负载。自己写了一个启停服务的脚本。

启停服务脚本如下,命名为”main.sh”,运行脚本后会在同级目录产生一个“main.port”文件,文件中记录启动的端口号列表。

#!/bin/sh
#
# Filename:    main.sh
# Revision:    1.0
# Date:        2012-06-14
# Author:      simonzhang
# web:         www.simonzhang.net
# Email:       simon-zzm@163.com
#
### END INIT INFO

# Source function library.
. /etc/profile

# Set the base value
listen_line=4
listen_start=8880

# 
CWD=`pwd`
cd $CWD

# See how we were called.
case "$1" in
  start)
        /bin/rm -rf main.port
	for (( i=0 ; i<${listen_line} ; i++)); do
            listen_port=$[${listen_start}+${i}]
            echo ${listen_port} >> main.port
            python main.py ${listen_port} &
	done
        echo "start ok !"
        ;;
  stop)
        get_port_line=`/bin/cat main.port`
        for i in ${get_port_line};do
             now_pid=`/bin/ps -ef|grep ${i}|grep -v grep|awk ' ''{print $2}'`
             /bin/kill -9 $now_pid
        done
        echo "stop"
        ;;
  status)
        get_port_line=`/bin/cat main.port`
        for i in ${get_port_line};do
             now_pid=`/bin/ps -ef|grep ${i}|grep -v grep`
             if [ -z "${now_pid}" ] ; then
                 echo ${i} "is stop"
             else
                 echo ${now_pid}
             fi
        done
	;;
  restart)
	$0 stop
	$0 start
	;;
  *)
        echo $"Usage: $0 {start|stop|restart|status}"
        exit 1
esac

exit $rc

  将启动文件也做一下修改,mian.py代码如下:

#!/bin/python
#-*- coding:utf-8 -*-
# Filename:    main.py
# Revision:    1.0
# Date:        2012-06-14
# Author:      simonzhang
# web:         www.simonzhang.net
# Email:       simon-zzm@163.com
### END INIT INFO
import sys
import tornado.ioloop
import tornado.web


class MainHandler(tornado.web.RequestHandler):
    def get(self):
        self.write("Hello, world")


application = tornado.web.Application([
    (r"/", MainHandler),
])


if __name__ == "__main__":
    listen_port =  sys.argv[1]
    application.listen(listen_port)
    tornado.ioloop.IOLoop.instance().start()

  现在配置nginx,只列出了http中的连接配置

   #######################www.grabproxy.com
   upstream  www_test_com {
            server 192.168.1.41:8880;
            server 192.168.1.41:8881;
            server 192.168.1.41:8882;
            server 192.168.1.41:8883;
       }
   server {
        listen       80;
        server_name 192.168.1.41;
        location / {
        proxy_cache_key $host$uri$is_args$args;
        proxy_redirect          off;
        proxy_set_header        X-Real-IP $remote_addr;
        proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header        Host $http_host;
        client_max_body_size   10m;
        proxy_connect_timeout  300;
        proxy_send_timeout     300;
        proxy_read_timeout     300;
        proxy_buffer_size      16k;
        proxy_buffers          4 32k;
        proxy_busy_buffers_size 64k;
        proxy_temp_file_write_size 64k;
        access_log  logs/access.log  main ;
        access_log on;
        proxy_pass              http://www_test_com;
        }
     }

  重新加载nginx配置,在浏览器里输入http://192.168.1.41,又看到经典页面。 

简单做个压力测试,结果如下,cup主频3.0。

# ./ab -c1000 -n10000 http://192.168.1.41:8880/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.1.41 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests

Server Software: TornadoServer/2.1.1
Server Hostname: 192.168.1.41
Server Port: 8880

Document Path: /
Document Length: 21 bytes

Concurrency Level: 1000
Time taken for tests: 8.013 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1790000 bytes
HTML transferred: 210000 bytes
Requests per second: 1247.92 [#/sec] (mean)
Time per request: 801.331 [ms] (mean)
Time per request: 0.801 [ms] (mean, across all concurrent requests)
Transfer rate: 218.14 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 166 680.6 0 3004
Processing: 20 115 46.5 100 873
Waiting: 20 115 46.5 100 873
Total: 47 281 686.0 100 3325

Percentage of the requests served within a certain time (ms)
50% 100
66% 101
75% 106
80% 124
90% 290
95% 3105
98% 3134
99% 3149
100% 3325 (longest request)

但是通过nginx负载后性能却下降,部分结果如下:
Failed requests: 9609
(Connect: 0, Receive: 0, Length: 9609, Exceptions: 0)
Write errors: 0
Non-2xx responses: 391
Total transferred: 2256920 bytes
HTML transferred: 274515 bytes
Requests per second: 714.81 [#/sec] (mean)

总结:tornado性能不错,但是nginx负载后直接使用proxy转发,耗费在连接上的资源比较高。这个和资料里介绍的出入较大,以后再找详细原因。如果单从性能讲nginx+uwsgi+django的组合效率不错。如果要使用tornado在多核服务器上工作,有两个方案,一使用lvs做底层的负载,二在开发中在代码里使用多线程进行处理。tornado对于wsgi文档中原文“Tornado 对 WSGI 只提供了有限的支持,即使如此,因为 WSGI 并不支持非阻塞式的请求,所以如果你使用 WSGI 代替 Tornado 自己的 HTTP 服务的话,那么你将无法使用 Tornado 的异步非阻塞式的请求处理方式。比如 @tornado.web.asynchronous、httpclient 模块、auth 模块,这些将都无法使用。”还要说明,node.js是非常快,但是node.js+express后效率并不理想,闲人可以测试一下。

听说gevent也不错,但是粗看了一下,没有看到中文文档。tornado对于个人来说,这个性能已经满足需要了。开始进一步学习。

6月 09

python 统计一段字符串中某字符串出现次数

统计一段字符串中某字符串出现次数。如查询“It takes only a minute to get a crush on someone,an hour to like someone,and a day to love someone- but it takes a lifetime to forget someone”中出现“takes”出现的次数。

str = ‘It takes only a minute to get a crush on someone,an hour to like someone,and a day to love someone- but it takes a lifetime to forget someone’
get_count = count_find_str(str, 'takes')
print get_count

函数如下

def count_find_str(str,find_str):
    _str = str
    _find_str = find_str
    _pos = _str.find(_find_str)
    _find_str_count = 0
    while _pos != -1:
        _find_str_count = _find_str_count + 1
        _pos = _pos + len(_find_str)
        _pos = _str.find(_find_str, _pos)
    return _find_str_count

注意:在单词过短的情况下会出现错误,如统计“a”,除了“a”本身作为单词出现,在单词中的“a”字母也会被统计。所以,建议统计字符串短写的长单词。我是为了统计数据库连接数所写的这个,实际测试还是很好用。

5月 27

从mysql向redis中加载数据测试

  有测试显示reids如果使用持久化测试后效率会下降,所以不使用持久化。现在来测试一下从mysql中捞取数据加载到redis中的速度。
  服务器使用8核2.6 cpu,内存8G,sas硬盘,Centos5.6 64位操作系统。python 2.6 redis2.4.13.
  使用测试代码如下,从mysql的photo表中捞取两列数据加载到redis中,这两列在表中都有索引,数据量28万。

#!/bin/env python
# -------------------------------------------------
# Filename:    
# Revision:    
# Date:        2012-05-27
# Author:      simonzhang
# Email:       simon-zzm@163.com
# -------------------------------------------------
import MySQLdb
import redis


def redis_run(sql_data):
    try:
        r = redis.Redis(host='192.168.1.100', password = '123456', port=6379, db=0)
    except redis.RedisError, e:
        print "Error %s" % e
    for i in sql_data:
        r.set(str(i[0]),i[1])
        

def mysql_run(sql):
    try:
        db = MySQLdb.connect(host='192.168.1.100', user='test', passwd ='123456', db='photo')
        cursor = db.cursor()   
    except MySQLdb.Error, e:
        print "Error %d:%s" % (e.args[0],e.args[1])
        exit(1)
    try:
        result_set = ''
        cursor.execute('%s' % sql)
        result_set=cursor.fetchall()
        cursor.close()
        db.close()
        return  result_set
    except MySQLdb.Error, e:
        print "Error %d:%s" % (e.args[0], e.args[1])
        cursor.close()
        db.close()

def main():
    _loop = 0
    _limit_start = 0
    _limit_span = 10000
    _count_result = 5
    while _count_result > 0:
        result_data = ''
        sql = "select id as pid, userid as uid from photo LIMIT %s,%s" % (_limit_start + _limit_span * _loop, _limit_span)
        result_data = mysql_run(sql)
        _count_result = len(result_data)
        redis_run(result_data)
        _loop += 1


if __name__ == '__main__':
    main()

进行测试,分别为每次捞取50万,10万,5万,1万,结果如下:

50万
real 0m26.239s
user 0m16.816s
sys 0m5.745s

10万
real 0m24.019s
user 0m15.670s
sys 0m4.932s

5万
real 0m26.061s
user 0m15.789s
sys 0m4.674s

1万
real 0m28.705s
user 0m15.778s
sys 0m4.913s

结论:每次捞取10万效率会比较理想,对于操作系统的压力不大,所以硬件方面不用考虑。
这里两列保存的都是id,加入用户id和照片id长度都是9位,一组数据是18位。一亿组数据也就需要2G内存。
通过计算28万需要24秒,如果有1亿的数据,全部倒入要2个半小时。所以内存存储不是问题。不知道用固态硬盘是否能快,我没有就不知道了。所以要做三件事,一做好集群,将数据及时同步到其他机房,自己写个程序同步定时同步,如果用主从,主机重启了为空,这个就很麻烦了,二使用redis的数据持久化,肯定比从mysql中直接捞快,三天天烧香希望不要宕机。

5月 22

python 访问接口获得 WSDL 数据

  需求访问http://192.168.1.100:8080/Service?wsdl,获取统计数字,接口“Count”参数有“user:string,pwd:string”。
  WSDL是Web Service的描述语言,是一种接口定义语言,用于描述Web Service的接口信息等。
  首先,安装SOAPpy
easy_install SOAPpy
代码如下:

def get_service():
    _url = "http://192.168.1.100:8080/Service?wsdl"
    _user = "test"
    _pwd = "test"
    try:
        server = SOAPpy.SOAPProxy(_url)
        get_result = server.Count(_user, _pwd)
    except:
        get_result = "Error!"
    return "%s" % get_result
5月 13

scrapy 入门 学习笔记 二

scrapy 入门 学习笔记 二

  蜘蛛抓回数据需要进行分析。首先要了解一下XPath。XPath 是一门在 XML 文档中查找信息的语言。XPath 用于在 XML 文档中通过元素和属性进行导航。XPath 是 W3C 标准,有空再学习吧。简单看一下html dom(Document Object Model)的文档对象模型.

  首先用scrapy shell 进行交互式解析的实验使用命令如下:
# scrapy shell http://www.simonzhang.net/

获得数据并进入shell中
<......>DEBUG: Crawled (200) (referer: None)
[s] Available Scrapy objects:
[s] hxs

  修改代码,来抓取头文件中的连接。代码如下:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector


class   DmozSpider (BaseSpider): 
      name  =  "simonzhang" 
      allowed_domains = ["simonzhang.net"] 
      start_urls = [ 
           "http://www.simonzhang.net/"] 

      def  parse (self,response): 
           hxs = HtmlXPathSelector(response)
           site_code = hxs.select('//html/head')
           for l in site_code:
               _link = l.select('link/@href').extract()
               print "================"
               print _link
               print "================"

  运行蜘蛛,得到的抓取的结果。获得的结果需要保存,这时就用到item。item对象是python的字典,将字段和值进行关联。编辑spiders上层的items.py文件。

from scrapy.item import Item, Field

class ScrapytestItem(Item):
    # define the fields for your item here like:
    # name = Field()
    title =  Field()
    head_link = Field()
    head_meta = Field()
    pass

  然后需要修改蜘蛛文件,代码为:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector
from simonzhang.items import ScrapytestItem  #ScrapytestItem引用上级目录里的类

class   DmozSpider (BaseSpider): 
    name = "simonzhang" 
    allowed_domains = ["simonzhang.net"] 
    start_urls = [ 
         "http://www.simonzhang.net/"] 

    def  parse (self,response): 
         hxs = HtmlXPathSelector(response)
         site_code = hxs.select('//html/head')
         items = []
         for l in site_code:
             item = ScrapytestItem()
             item['title'] = l.select('title/text()').extract()
             item['head_link'] = l.select('link/@href').extract()
             item['head_meta'] = l.select('meta').extract()
             items.append(item)
         return items 

  运行一下命令进行抓取,成功后就会在同级目录产生一个json的文件,里面保存的是抓取的内容。对于抓取小型的项目足够用了。
scrapy crawl simonzhang -o simonzhang.json -t json

  上条命令里的“-o”为指定输出文件,“-t”为指定输出格式,更多的参数,可以使用“scrapy crawl –help”参考。

scrapy 入门 学习笔记 一
http://www.simonzhang.net/wp-admin/post.php?post=1108&action=edit