多进程爬虫

单进程的收集数据浪费的时间较长,这时候我们可以多开几个进程去访问那些网站,提高了效率

多进程的使用方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from multiprocessing import Pool
pool=Pool(processes=4)
pool.map(func,iterable)
```
### 实例(爬取糗事网站)
``` bash
import re
import time
from multiprocessing import Pool
import requests
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
}
def re_script(url):
res=requests.get(url,headers=headers)
names=re.findall('<h2>(.*?)</h2>',res.text,re.S)
contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S)
laughs = re.findall('<i class="number">(\d+)</i>',res.text,re.S)
comments = re.findall('<i class="number">(\d+)</i>', res.text, re.S)
for name,content,laugh,comment in zip(names,contents,laughs,comments):
info={
'name':name.strip(),
'content':content.strip(),
'laugh':laugh .strip(),
'comment':comment .strip()
}
print(info)
if __name__=="__main__":
urls=['https://www.qiushibaike.com/8hr/page/{}/'.format(str(i)) for i in range(1,3)]
start_1=time.time()
for url in urls:
re_script(url)
end_1=time.time()
print('这是一个接着一个的耗时:',end_1-start_1)
start_2=time.time()
pool=Pool(processes=2)
pool.map(re_script,urls)
end_2=time.time()
print('两个进程所耗时间: ', end_2-start_2)
start3=time.time()
pool=Pool(processes=4)
pool.map(re_script,urls)
end3=time.time()
print('四个进程耗时:',end3-start3)
Fork me on GitHub