Page 2 of 3123

5月 13

scrapy 入门学习笔记二

Posted on 2012 年 5 月 13 日 by 张子萌

scrapy 入门学习笔记二

　　蜘蛛抓回数据需要进行分析。首先要了解一下XPath。XPath 是一门在 XML 文档中查找信息的语言。XPath 用于在 XML 文档中通过元素和属性进行导航。XPath 是 W3C 标准，有空再学习吧。简单看一下html dom(Document Object Model)的文档对象模型.

　　首先用scrapy shell 进行交互式解析的实验使用命令如下：
# scrapy shell http://www.simonzhang.net/

获得数据并进入shell中
<......>DEBUG: Crawled (200) (referer: None)
[s] Available Scrapy objects:
[s] hxs

　　修改代码，来抓取头文件中的连接。代码如下：

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector


class   DmozSpider (BaseSpider): 
      name  =  "simonzhang" 
      allowed_domains = ["simonzhang.net"] 
      start_urls = [ 
           "http://www.simonzhang.net/"] 

      def  parse (self,response): 
           hxs = HtmlXPathSelector(response)
           site_code = hxs.select('//html/head')
           for l in site_code:
               _link = l.select('link/@href').extract()
               print "================"
               print _link
               print "================"

　　运行蜘蛛，得到的抓取的结果。获得的结果需要保存，这时就用到item。item对象是python的字典，将字段和值进行关联。编辑spiders上层的items.py文件。

from scrapy.item import Item, Field

class ScrapytestItem(Item):
    # define the fields for your item here like:
    # name = Field()
    title =  Field()
    head_link = Field()
    head_meta = Field()
    pass

　　然后需要修改蜘蛛文件，代码为：

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector
from simonzhang.items import ScrapytestItem  #ScrapytestItem引用上级目录里的类

class   DmozSpider (BaseSpider): 
    name = "simonzhang" 
    allowed_domains = ["simonzhang.net"] 
    start_urls = [ 
         "http://www.simonzhang.net/"] 

    def  parse (self,response): 
         hxs = HtmlXPathSelector(response)
         site_code = hxs.select('//html/head')
         items = []
         for l in site_code:
             item = ScrapytestItem()
             item['title'] = l.select('title/text()').extract()
             item['head_link'] = l.select('link/@href').extract()
             item['head_meta'] = l.select('meta').extract()
             items.append(item)
         return items

　　运行一下命令进行抓取，成功后就会在同级目录产生一个json的文件，里面保存的是抓取的内容。对于抓取小型的项目足够用了。
scrapy crawl simonzhang -o simonzhang.json -t json

　　上条命令里的“-o”为指定输出文件，“-t”为指定输出格式，更多的参数，可以使用“scrapy crawl –help”参考。

scrapy 入门学习笔记一
http://www.simonzhang.net/wp-admin/post.php?post=1108&action=edit

4月 17

python 使用资源不断收集中

Posted on 2012 年 4 月 17 日 by 张子萌

安装使用：
easy_install安装升级工具，根据自己需要版本获取。有了这个很多东西都不需要了。
http://pypi.python.org/pypi/setuptools#downloads
wget -q http://peak.telecommunity.com/dist/ez_setup.py

开发环境搭建：
eclipse 环境集成安装
http://pydev.org/updates

开发框架：
py2exe
http://starship.python.net/crew/theller/py2e

windows下的摄像头获取
http://videocapture.sourceforge.net/

python转C并编译
http://cython.org/

学习：
watchdog
watchdog 用来监控文件系统事件的Python API和shell实用工具。

pattern
Pattern Web数据挖掘模块。可用于数据挖掘、自然语言处理、机器学习和网络分析

django-sentry
实时Django的异常记录，Django的异常记录到数据库处理程序。

excel操作
http://www.python-excel.org/

3月 29

shell 判断字符串长度

Posted on 2012 年 3 月 29 日 by 张子萌

需要循环输出一组序列，序列定义为双位数。所以开始的序列要判断一下字符长度。如果变量为h，计算字符串长度为“${#h}”。这个代码代码如下：

#!/bin/sh
for (( h=0 ; h<20 ;h++ ));do
   if (( ${#h} == 1 ));then
      echo '0'$h
   else
      echo $h
   fi
done

输出结果：
# sh 123.sh
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19

3月 01

shell 打印拷贝进度

Posted on 2011 年 3 月 1 日 by 张子萌

写了一个脚本做大量数据拷贝，但是拷贝的时间比较常，不知道当前进度如何，是否死机，于是写了一段关于拷贝百分比的输出，详细参见以下脚本。

#!/bin/bash
# ——————————————————————————-
# Filename: copy_percent.sh
# Revision: 1.0
# Date: 2011-03-01
# Author: simon-zzm
# Email: simon-zzm@163.com
# ——————————————————————————-

SOURCE=$1
TARGET=$2

/bin/cp -r “$SOURCE” “$TARGET” &

compute_percent() #
{
SOURCE_SIZE=`/usr/bin/du -s ${1}|awk ‘ ”{print $1}’`
TARGET_SIZE=`/usr/bin/du -s ${2}|awk ‘ ”{print $1}’`
let “i=(${TARGET_SIZE}*100)/${SOURCE_SIZE}”
return $i
}

compute_percent ${SOURCE} ${TARGET}
while [ ${i} -lt ‘100’ ]
do
compute_percent ${SOURCE} ${TARGET}
echo $i’%’
sleep 1
done
echo “ok”

4月 07

shell 中 exit 参数

Posted on 2010 年 4 月 7 日 by 张子萌

【整理人：张子萌 2010-4】
主要用于输入条件判断出错脚本退出和脚本运行完毕发出退出信号。

exit语句可以带一个可选参数，参数是一个整数退出状态码。
储存在 $? 中的返回给父进程的退出状态码。
0 表示脚本成功运行完毕。

1 表示程序不正常结束

exit如果不带参数，父shell使用 $? 变量的现存值。

simonzhang的家

有朋自远方来。。。。。

Tag Archives: shell

scrapy 入门学习笔记二

python 使用资源不断收集中

shell 判断字符串长度

shell 打印拷贝进度

shell 中 exit 参数

2025年七月
一	二	三	四	五	六	日
« 1月
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31