爬虫实战一之糗事百科

目标:
(1)抓取糗事百科热门段子
(2)过滤带有图片的段子
(3)实现显示一个段子的发布时间,发布人,段子内容,点赞数

确定URL并抓取页面代码:

先用最简单的代码来实现抓取看看能不能成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import urllib.request
import urllib.error
import re
page = 2
url = 'https://www.qiushibaike.com/8hr/page/' + str(page) + '/?s=5002335'
try:
request=urllib.request.Request(url)
response=urllib.request.urlopen(request)
print(response.read())
except urllib.error.URLError as e:
if hasattr(e, 'code'):
print(e.code)
if hasattr(e, 'reason'):
print(e.reason)

运行结果:
raise RemoteDisconnected(“Remote end closed connection without”
http.client.RemoteDisconnected: Remote end closed connection without response
结果分析:应该是headers验证的问题,加上一个headers验证试试看;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import urllib.request
import urllib.error
import re
page = 2
url = 'https://www.qiushibaike.com/8hr/page/' + str(page) + '/?s=5002335'
try:
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36')
response = urllib.request.urlopen(req)
print(response.read())
except urllib.error.URLError as e:
if hasattr(e, 'code'):
print(e.code)
if hasattr(e, 'reason'):
print(e.reason)

运行结果:打印出了第二页的HTML代码

提取某一页的所有段子:(并过滤图片代码)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import urllib.request
import urllib.error
import re
page = 2
url = 'https://www.qiushibaike.com/8hr/page/' + str(page) + '/?s=5002335'
try:
req = urllib.request.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36')
response = urllib.request.urlopen(req)
content = response.read().decode('utf-8')
pattern = re.compile('title=.*?<h2>(.*?)</h2>.*?<span>(.*?)</span>.*?</a>(.*?)<div class="stats">.*?class="number">(.*?)</i>',re.S)
items = re.findall(pattern, content)
for item in items:
haveImg = re.search('img',item[2])
if not haveImg:
print (item[0]+'\n'+item [1]+'\n'+item [3])
except urllib.error.URLError as e:
if hasattr(e, 'code'):
print(e.code)
if hasattr(e, 'reason'):
print(e.reason)

正则表达式分析:
(1)(.?)代表一个分组,在这个正则表达式中,我匹配了三个分组,在后面的遍历item中,item[0]就代表第一个(.?)代表的内容,以此类推。
(2)re.S表示在匹配时为点任意模式,点也可以表示换行符。

我们可以发现,带有图片的段子会有类似img的代码,而不带图片的则没有,所以,我们的正则表达式的item[2]就是获取了图片的内容,如果不带图片,item[2]获取的内容就是空。

http://blog.csdn.net/finna_xu/article/details/68070662

文章目录
  1. 1. 确定URL并抓取页面代码:
  2. 2. 提取某一页的所有段子:(并过滤图片代码)