目标: (1)抓取糗事百科热门段子 (2)过滤带有图片的段子 (3)实现显示一个段子的发布时间,发布人,段子内容,点赞数
确定URL并抓取页面代码: 先用最简单的代码来实现抓取看看能不能成功1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import urllib.request
import urllib.error
import re
page = 2
url = 'https://www.qiushibaike.com/8hr/page/' + str(page) + '/?s=5002335'
try:
request=urllib.request .Request (url)
response=urllib.request .urlopen (request)
print(response.read())
except urllib.error .URLError as e:
if hasattr(e, 'code' ):
print(e.code)
if hasattr(e, 'reason' ):
print(e.reason)
运行结果: raise RemoteDisconnected(“Remote end closed connection without” http.client.RemoteDisconnected: Remote end closed connection without response 结果分析:应该是headers验证的问题,加上一个headers验证试试看;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import urllib.request
import urllib.error
import re
page = 2
url = 'https://www.qiushibaike.com/8hr/page/' + str(page) + '/?s=5002335'
try:
req = urllib.request .Request (url)
req.add_header('User-Agent' , 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' )
response = urllib.request .urlopen (req)
print(response.read())
except urllib.error .URLError as e:
if hasattr(e, 'code' ):
print(e.code)
if hasattr(e, 'reason' ):
print(e.reason)
运行结果:打印出了第二页的HTML代码
提取某一页的所有段子:(并过滤图片代码) 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import urllib.request
import urllib.error
import re
page = 2
url = 'https://www.qiushibaike.com/8hr/page/' + str(page) + '/?s=5002335'
try :
req = urllib.request.Request(url)
req.add_header('User-Agent' , 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36' )
response = urllib.request.urlopen(req)
content = response.read ().decode('utf-8' )
pattern = re.compile('title=.*?<h2>(.*?)</h2>.*?<span>(.*?)</span>.*?</a>(.*?)<div class="stats">.*?class="number">(.*?)</i>' ,re.S)
items = re.findall(pattern, content)
for item in items :
haveImg = re.search('img' ,item [2 ])
if not haveImg:
print (item [0 ]+'\n' +item [1 ]+'\n' +item [3 ])
except urllib.error.URLError as e:
if hasattr(e, 'code' ):
print(e.code)
if hasattr(e, 'reason' ):
print(e.reason)
正则表达式分析: (1)(.?)代表一个分组,在这个正则表达式中,我匹配了三个分组,在后面的遍历item中,item[0]就代表第一个(. ?)代表的内容,以此类推。 (2)re.S表示在匹配时为点任意模式,点也可以表示换行符。
我们可以发现,带有图片的段子会有类似img的代码,而不带图片的则没有,所以,我们的正则表达式的item[2]就是获取了图片的内容,如果不带图片,item[2]获取的内容就是空。
http://blog.csdn.net/finna_xu/article/details/68070662