使用scrapy抓取代理ip实例

使用scrapy抓取代理ip实例

准备

如果你尚未安装scrapy,或者不知道怎么创建爬虫项目,请参考 python scrapy框架 安装一节。

我们创建项目collectips;新建爬虫xicidaili。

主要代码解析

初始化

1
2
3
name = 'xicidaili'
allowed_domains = ['www.xicidaili.com']
start_urls = ['http://www.xicidaili.com/nn/1']

解析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def parse(self, response):
ip_list = response.xpath('//*[@id="ip_list"]')
trs = ip_list[0].xpath('tr')

for tr in trs[1:]:
item = CollectipsItem()

item['ip'] = tr.xpath('td[2]/text()').extract()
item['port'] = tr.xpath('td[3]/text()').extract()
item['server_address'] = tr.xpath('td[4]/a/text()').extract()
item['is_gao_ni'] = tr.xpath('td[5]/text()').extract()
item['ip_type'] = tr.xpath('td[6]/text()').extract()
item['speed'] = tr.xpath('td[7]/div/@title').extract()[0]
item['connection_time'] = tr.xpath('td[8]/div/@title').extract()[0]
item['alive_time'] = tr.xpath('td[9]/text()').extract()
item['check_time'] = tr.xpath('td[10]/text()').extract()
myProxy = str(tr.xpath('td[2]/text()').extract()[0]) + str(':') + str(tr.xpath('td[3]/text()').extract()[0])
if self.is_valid_ip(myProxy) == 1:
print 'save a valid ip'
objFile = open('ip_port_list.txt', 'a')
rowData = myProxy
objFile.write(rowData)
objFile.write('\n')
objFile.close()
yield item

pagination = response.xpath('//*[@id="body"]/div[2]')
alist = pagination[0].xpath('a')
pageNum = response.url.split('/')[-1]
if pageNum >= 5:
print 'crawl end...'
exit(0)

for a in alist:
class_name = a.xpath('text()').extract()
class_name2 = class_name[0].strip()
if u'下一页 ›' == class_name2:
print 'next page...............'
nextPageHref = a.xpath('@href').extract()[0]
nextPageHrefFullUrl = response.urljoin(nextPageHref)
yield scrapy.Request(nextPageHrefFullUrl, callback=self.parse)

检验代理ip可用性

我们需要对抓取的代理ip的可用性进行验证,方式如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
def is_valid_ip(self, proxy):
print 'check ip...'

try:
protocol = 'http://'
proxies = {protocol: proxy}
if requests.get('http://www.baidu.com', proxies=proxies, timeout=2).status_code == 200:
return 1
except:
print 'ip is not valid...'
pass

return 0

修改配置

  1. 修改下载延迟
1
2
3
4
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
  1. 新增下载中间件
1
2
3
4
5
6
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'collectips.middlewares.MyCustomDownloaderMiddleware': 543,
'collectips.middlewares.CollectipsSpiderMiddleware': 100,
}
  1. 代理IP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
IP_POOL = [
{"ipaddr":"59.40.69.231:8010"},
{"ipaddr":"27.219.36.127:8118"},
{"ipaddr":"118.187.58.34:53281"},
{"ipaddr":"117.78.37.198:8000"},
{"ipaddr":"42.55.171.123:80"},
{"ipaddr":"119.115.21.17:80"},
{"ipaddr":"17-11-05 11:44:4395"},
{"ipaddr":"59.40.50.169:8010"},
{"ipaddr":"61.129.70.131:8080"},
{"ipaddr":"61.152.81.193:9100"},
{"ipaddr":"120.204.85.29:3128"},
{"ipaddr":"219.228.126.86:8123"},
{"ipaddr":"61.152.81.193:9100"},
{"ipaddr":"218.82.33.225:53853"},
{"ipaddr":"223.167.190.17:42789"},
]
  1. USER_AGENT设置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
]

gogogo西刺代理

运行:

1
scrapy crawl xicidaili -o my_ip.json

抓取数据示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
"connection_time":"0.034秒",
"check_time":[
"17-11-12 16:37"
],
"is_gao_ni":[
"高匿"
],
"ip":[
"119.130.240.25"
],
"server_address":[
"广东广州"
],
"alive_time":[
"2小时"
],
"ip_type":[
"HTTPS"
],
"speed":"0.174秒",
"port":[
"8118"
]
}

引用

源码:collectips


来源:http://leunggeorge.github.io/

0%