python爬虫——从入门到放弃(2)

今天学习更高级的方式，昨天的方式只能用于静态页面，今天开始面向动态页面的第一步，深入学习urllib的使用。

请求的发送

昨天的urlopen是GET请求，我们有时候需要用到post请求，那么我们就需要加上data参数，timeout是超时时间request.urlopen('https://enterdawn.top/login.php', data=b'word=hello', timeout=10)
我们也可以使用带参数的POST，其中urlencode类型是一个字典，进行字符转换：

import urllib
params = urllib.urlencode({"value": "english", "page": 1})
f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)# value=english&page=1

遇到User-Agent验证的时候，我们也可以自定义Headers：

from urllib import request
url = 'https://enterdawn.top'
headers = {'user-agent': 'Mozilla/5.0 (Windows; Windows NT 10.0) Chrome/66.0.3359.181 '}
req = request.Request(url, headers=headers)
resp = request.urlopen(req)
print(resp.read().decode())

request常用参数：urllib.request.Request(url, data=None, headers={},origin_req_host=None,unverifiable=False, method=None，cookies=None)
有时候我们也需要带上cookie。我们使用http库中的cookiejar获取cookie：

from http import cookiejar
from urllib import request
cookie = cookiejar.CookieJar() # 创建一个对象
cookies = request.HTTPCookieProcessor(cookie)# 创建cookie处理器
opener = request.build_opener(cookies)# 创建Opener对象
response = opener.open('https://www.bing.com')# 发送请求
for i in cookie:# 打印获得的cookie
    print(i)

我们可以直接把cookie添加到request中，也可以插入到headers中（在headers里面添加一个key-value对，例如headers = {'Cookie': cookies}）。
我们还可以设置代理来隐藏我们的真实IP：

from urllib import request

url = 'https://enterdawn.top'
proxy = {'http':'127.0.0.1:80','https':'127.0.0.1:80'}# 这里是爬http用一个IP，爬https用另外一个IP
proxies = request.ProxyHandler(proxy)# 代理处理器
opener = request.build_opener(proxies)# opener对象
response = opener.open(url)
requests.get(url,proxies=proxy) # 也可以直接这样

其他

parse.quote用于url编码：

from urllib import parse
keyword = '启明'
parse.quote(keyword)

解码使用unquote()。
urllib.parse.urlencode()传入一个字典，用于大批量生成url参数：

from urllib import parse
p = {'co': '启明', 'p': '1', 'h': '17'}
parse.urlencode(p)# 生成'co=%E5%90%AF%E6%98%8E&p=1&h=17'

urllib.error设置了两个异常：URLError和HTTPError，HTTPError是URLError的子类。
HTTPError包三个属性：code：请求的状态码，reason：错误的原因，headers：响应的报头。

urllib3

urllib3提供了很多python标准库urllib里所没有的特性，是个第三方库：

pip install urllib3

最简单的使用例子：

import urllib3
urllib3.disable_warnings() # 用于禁用警告，否则https会有警告
http = urllib3.PoolManager() #一个连接池对象
re = http.request('GET', 'https://enterdawn.top') # 一个请求
print(re.status) #返回代码
print(re.data) #返回数据

我们可以在http.request里面添加各种参数：

http.request('GET', 'https://enterdawn.top'，headers={'X-Something': 'value'}, fields=url参数,timeout = 3)

代理在连接池中设置：

urllib3.ProxyManager('http://127.0.0.1:80', headers=headers)

urllib3不能直接设置cookies，只能将cookies设置在headers中。
headers的Content-Type参数用来发送一个已经过编译的JSON数据：

import urllib3
url = "http://httpbin.org"
import json
data = {'name':'enterdawn'}
jsond = json.dumps(data).encode('utf-8') 
http = urllib3.PoolManager()
r = http.request('post',url+"/post",body = jsond,headers = {'Content-Type':'application/json'})

使用multipart/form-data上传文件：

with open('example.txt') as fp:
file_data = fp.read()
r = http.request('POST','http://httpbin.org/post',fields={'filefield': ('example.txt', file_data),})

如果是发送原始二进制数据，只要将其定义为body参数即可。同时，建议对header的Content-Type参数进行设置：

with open('example.jpg', 'rb') as fp:
     binary_data = fp.read()
r = http.request('POST','http://httpbin.org/post',body=binary_data,headers={'Content-Type': 'image/jpeg'})

参考：Python urllib、urllib2、urllib3用法及区别，IoneFine，https://blog.csdn.net/jiduochou963/article/details/87564467，CC 4.0 BY-SA

请求的发送

其他

urllib3

推荐文章