2017-07-21

使用urllib访问网页

urllib.request是一个用于访问URL的Python模块，urlopen函数可以访问不同协议的URL。

一个简单的读取URL信息的代码

import urllib.request
response = urllib.request.urlopen("http://www.fishc.com")   #打开网页
html = response.read()     #读取URL信息   是二进制字符串
html = html.decode("utf-8")   #解码操作
print(html)

下载一个图片并保存

import urllib.request
response = urllib.request.urlopen('https://ss2.baidu.com/73Z1bjeh1BF3odCf/it/u=2991169923,1001222125&fm=202')
aa = response.read()
with open('ss.jpg', 'wb') as f:
    f.write(aa)

=================== RESTART: C:\Users\dell\Desktop\dad.py ===================
>>> response.geturl()
'https://ss2.baidu.com/73Z1bjeh1BF3odCf/it/u=2991169923,1001222125&fm=202'
>>> response.info()
<http.client.HTTPMessage object at 0x04340430>
>>> print(response.info())
Server: bfe/1.0.8.13-sslpool-patch
Date: Fri, 21 Jul 2017 03:20:48 GMT
Content-Type: image/jpeg
Content-Length: 2876
Connection: close
ETag: 0ba505cffa84a3ec5ff73df2ba148aef
Last-Modified: Thu, 01 Jan 1970 00:00:00 GMT
Expires: Tue, 08 Aug 2017 00:55:09 GMT
Age: 658546
Cache-Control: max-age=2628000
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Ohc-Response-Time: 1 0 0 0 0 0
Timing-Allow-Origin: http://www.baidu.com
>>> response.getcode()
200

response.geturl() 是获取图片地址信息的
response.info() 是http manage对象
print(response.info()) 打印对象
response.getcode() 获取对象状态

利用有道词典来翻译文本：

Get是向服务器索取数据的一种请求，而Post是向服务器提交数据的一种请求。

隐藏访问用header

urlparse负责解析功能

encode('utf-8')是将utf-8的形式转化为其他形式
decode('utf-8')是将其他形式转化为utf-8

import urllib.request
import urllib.parse
import json
import time
while True:
    content = input("请输入要翻译的内容(输入q，退出:")
    if content == 'q':
        break
    url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc'
    data={}
    data['type'] = 'AUTO'
    data['i'] = content
    data['doctype'] = 'json'
    data['xmlversion'] = '2.1'
    data['keyfrom'] = 'fanyi.web'
    data['ue'] = 'UTF-8'
    data['action'] = 'FY_BY_CL1CKBUTTON'
    data['typoResult'] = 'true'
    data = urllib.parse.urlencode(data).encode('utf-8')#编码成URL格式，所以用到模板urllib.parse
    req = urllib.request.Request(url,data)
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.6.2.15784')
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')
    target = json.loads(html)    #将html的值传给terget
    target = target['translateResult'][0][0]['tgt']
    print(target)
    time.sleep(5)  #下次循环时间间隔为5秒

通过json.loads(html)可以看到html字典里的内容，将html的值传给terget

修改header有两种方法：
1.通过Request的headers参数修改：

1
2
3

req = urllib.request.Request(url, data, head)
head = {}
head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.6.2.15784'

2.通过Request.add_header()方法修改：

1
2

req = urllib.request.Request(url, data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.6.2.15784')

头信息：

我们在这里讨论一个特殊的HTTP头信息，来说明如何在你的HTTP请求中添加头信息。

一些网站不喜欢被程序访问，或者会给不同的浏览器发送不同的版本。默认情况下，urllib会把自身标记为Python-urllib/x.y(其中x和y表示Python的版本号，如Python-urllib/2.5)，这可能会迷惑网站，或者干脆不起作用。浏览器标识自己的方式就是通过User-agent头信息。当你创建一个Request对象时，你可以传递一个头信息的字典。下面的例子发起的是跟上面一样的请求，但是把自己标识为一个IE浏览器的版本。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
data = {'name':'Michael Foord',
          'location':'Northampton',
          'language':'Python' }
data = urllib.parse.urlencode(data).encode('ascii')
req = urllib.request.Request(url, data)
req.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(req)
html = response.read()
print(html)

处理异常：

urlopen在不能处理某个响应的时候会抛出URLError(虽然一般使用Python API时，ValueError、TypeError等内建异常也可能被抛出)。
HTTPError是URLError的子类，在遇到HTTP URL的特殊情况时被抛出。
异常类出自urllib.error模块。

URLError:
一般来说，URLError被抛出是因为没有网络连接(没有到指定服务器的路径)，或者是指定服务器不存在。在这种情况下，抛出的异常将会包含一个‘reason’属性，这是包含一个错误码和一段错误信息的元组。

HTTPError:
每一个来自服务器的HTTP响应都包含一个数字的“状态码”。有时状态码表明服务器不能执行请求。默认的处理程序会为你处理其中的部分响应(比如，如果响应是“重定向”，要求客户端从一个不同的URL中获取资料，那么urllib将会为你处理这个)。对于那些不能处理的响应，urlopen将会抛出一个HTTPError。典型的错误包括‘404’(页面未找到)，‘403’(请求禁止)，和‘401’(请求认证)。抛出的HTTPError实例有一个整型的‘code’属性，对应于服务器发送的错误。

如果希望程序对HTTPError和URLError有所准备,则可以这样写代码：

import urllib.request
import urllib.error
req = urllib.request.Request('http://www.pretend_server.org')
try:
    response = urllib.request.urlopen(req)
except urllib.error.URLError as e:
    if hasattr(e, 'reason'):
        print('We failed to reach a server.')
        print('Reason: ', e.reason)
    elif hasattr(e, 'code'):
        print('The server couldn\'t fulfill the request.')
        print('Error code: ', e.code)
    else:
        pass

本文标题:使用urllib访问网页

文章作者:sweet

发布时间:2017-07-21, 09:55:24

最后更新:2017-10-21, 19:00:38

原始链接:http://yoursite.com/2017/07/21/使用urllib访问网页/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。