零基础自学用Python 3开发网络爬虫(三): 伪装浏览器君_编程语言论坛

论坛 >编程语言 >零基础自学用Python 3开发网络爬虫(三): 伪装浏览器君

零基础自学用Python 3开发网络爬虫(三): 伪装浏览器君

课课家iOS游客发布于 2017-09-28 09:41查看:910回复:1

添加超时跳过功能

首先, 我简单地将

改为

运行后发现, 当发生超时, 程序因为exception中断. 于是我把这一句也放在try .. except 结构里, 问题解决.

支持自动跳转

在爬 http://baidu.com 的时候, 爬回来一个没有什么内容的东西, 这个东西告诉我们应该跳转到 http://www.baidu.com . 但是我们的爬虫并不支持自动跳转, 现在我们来加上这个功能, 让爬虫在爬 baidu.com 的时候能够抓取 www.baidu.com 的内容.

首先我们要知道爬 http://baidu.com 的时候他返回的页面是怎么样的, 这个我们既可以用 Fiddler 看, 也可以写一个小爬虫来抓取. 这里我抓到的内容如下, 你也应该尝试一下写几行 python 来抓一抓.

看代码我们知道这是一个利用 html 的 meta 来刷新与重定向的代码, 其中的0是等待0秒后跳转, 也就是立即跳转. 这样我们再像上一次说的那样用一个正则表达式把这个url提取出来就可以爬到正确的地方去了. 其实我们上一次写的爬虫已经可以具有这个功能, 这里只是单独拿出来说明一下 http 的 meta 跳转.

伪装浏览器正规军

前面几个小内容都写的比较少. 现在详细研究一下如何让网站们把我们的Python爬虫当成正规的浏览器来访. 因为如果不这么伪装自己, 有的网站就爬不回来了. 如果看过理论方面的知识, 就知道我们是要在 GET 的时候将 User-Agent 添加到header里.

如果没有看过理论知识, 按照以下关键字搜索学习吧 :D

HTTP 报文分两种: 请求报文和响应报文

请求报文的请求行与首部行

GET, POST, HEAD, PUT, DELETE 方法

我用 IE 浏览器访问百度首页的时候, 浏览器发出去的请求报文如下:

然后百度收到这个消息后, 返回给我的的响应报文如下(有删节):

如果能够看懂这段话的第一句就OK了, 别的可以以后再配合 Fiddler 慢慢研究. 所以我们要做的就是在 Python 爬虫向百度发起请求的时候, 顺便在请求里面写上 User-Agent, 表明自己是浏览器君.

在 GET 的时候添加 header 有很多方法, 下面介绍两种方法.

第一种方法比较简便直接, 但是不好扩展功能, 代码如下:

第二种方法使用了 build_opener 这个方法, 用来自定义 opener, 这种方法的好处是可以方便的拓展功能, 例如下面的代码就拓展了自动处理 Cookies 的功能.

上述代码运行后通过 Fiddler 抓到的 GET 报文如下所示:

可见我们在代码里写的东西都添加到请求报文里面了.

保存抓回来的报文

顺便说说文件操作. Python 的文件操作还是相当方便的. 我们可以讲抓回来的数据 data 以二进制形式保存, 也可以经过 decode() 处理成为字符串后以文本形式保存. 改动一下打开文件的方式就能用不同的姿势保存文件了. 下面是参考代码:

查看评分情况

全部评分

此主贴暂时没有点赞评分

总计：赞0次

回复分享

版主推荐

房屋出租管理系统小程序版（附项目源码）视频教程

人脸识别门禁系统（附vue+springboot项目源码）视频教程

健身房会员管理系统（附springmvc项目源码）视频教程

房屋出租管理系统（附vue+springboot项目源码）视频教程

车牌识别停车场管理系统（附SSM项目源代码）视频教程

上一篇：零基础自学用Python 3开发网络爬虫(二): 用到的数据结构简介以及爬虫Ver1.0 alpha
下一篇：python 线程之 Condition

精品在线课程【一线专家讲授+24小时内答疑+永久免费观看+市场1/10价格】

[换一换]

共有1条评论

希尔瓦娜斯
楼主说的好，继续加油
2017-09-29 09:15赞 (0)回复沙发

本论坛发帖,请先登录

发布新贴

版主招版主啦

IT宅男
mr jack
Mr ken
Mright
cappuccino
YUI
课课家运营团队
课课家技术团队1
酸酸~甜甜

课程推荐

[换一换]

《C深度解析》第一章 c程序的编译链接、目标文件格式、c程序内存结构视频教程: 11141人学习

十天掌握VB程序设计语言视频教程: 11360人学习

跟着王进老师学Python之Django篇第一季：Django框架入门视频教程: 21496人学习

Struts2+Hibernate5+Spring4五分钟上手视频课程: 11342人学习

楼主关注

发布新贴

选择版块:
标题:
内容
验证码:

编辑帖子

标题:
内容
<h2 style="border: 0px; margin: 0px 0px 20px; padding: 0px; font-size: 24px; font-stretch: normal; line-height: 36px; font-family: "Microsoft YaHei", "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; color: rgb(46, 46, 46); white-space: normal; background-color: rgb(255, 255, 255);">     添加超时跳过功能</h2><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        首先, 我简单地将<img src="/Public/forum/ueditor/image/20170928/1506562373636551.png" title="1506562373636551.png" alt="image.png"/> <p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        改为<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><img src="/Public/forum/ueditor/image/20170928/1506562393709196.png" title="1506562393709196.png" alt="image.png"/><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        运行后发现, 当发生超时, 程序因为exception中断. 于是我把这一句也放在try .. except 结构里, 问题解决.<h2 style="border: 0px; margin: 0px 0px 20px; padding: 0px; font-size: 24px; font-stretch: normal; line-height: 36px; font-family: "Microsoft YaHei", "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; color: rgb(46, 46, 46); white-space: normal; background-color: rgb(255, 255, 255);">     支持自动跳转</h2><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        在爬 http://baidu.com 的时候, 爬回来一个没有什么内容的东西, 这个东西告诉我们应该跳转到 http://www.baidu.com . 但是我们的爬虫并不支持自动跳转, 现在我们来加上这个功能, 让爬虫在爬 baidu.com 的时候能够抓取 www.baidu.com 的内容.<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        首先我们要知道爬 http://baidu.com 的时候他返回的页面是怎么样的, 这个我们既可以用 Fiddler 看, 也可以写一个小爬虫来抓取. 这里我抓到的内容如下, 你也应该尝试一下写几行 python 来抓一抓.<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><img src="/Public/forum/ueditor/image/20170928/1506562419901510.png" title="1506562419901510.png" alt="image.png"/><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        看代码我们知道这是一个利用 html 的 meta 来刷新与重定向的代码, 其中的0是等待0秒后跳转, 也就是立即跳转. 这样我们再像上一次说的那样用一个正则表达式把这个url提取出来就可以爬到正确的地方去了. 其实我们上一次写的爬虫已经可以具有这个功能, 这里只是单独拿出来说明一下 http 的 meta 跳转.<h2 style="border: 0px; margin: 0px 0px 20px; padding: 0px; font-size: 24px; font-stretch: normal; line-height: 36px; font-family: "Microsoft YaHei", "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; color: rgb(46, 46, 46); white-space: normal; background-color: rgb(255, 255, 255);">     伪装浏览器正规军</h2><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        前面几个小内容都写的比较少. 现在详细研究一下如何让网站们把我们的Python爬虫当成正规的浏览器来访. 因为如果不这么伪装自己, 有的网站就爬不回来了. 如果看过理论方面的知识, 就知道我们是要在 GET 的时候将 User-Agent 添加到header里.<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        如果没有看过理论知识, 按照以下关键字搜索学习吧 :D<ul style="border: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; list-style-position: outside; list-style-image: initial; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);" class=" list-paddingleft-2"><ul class=" list-paddingleft-2" style="list-style-type: square;">       HTTP 报文分两种: 请求报文和响应报文       请求报文的请求行与首部行       GET, POST, HEAD, PUT, DELETE 方法</ul></ul><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        我用 IE 浏览器访问百度首页的时候, 浏览器发出去的请求报文如下:<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><img src="/Public/forum/ueditor/image/20170928/1506562526992513.png" title="1506562526992513.png" alt="image.png"/><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        然后百度收到这个消息后, 返回给我的的响应报文如下(有删节):<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><img src="/Public/forum/ueditor/image/20170928/1506562551795760.png" title="1506562551795760.png" alt="image.png"/><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        如果能够看懂这段话的第一句就OK了, 别的可以以后再配合 Fiddler 慢慢研究. 所以我们要做的就是在 Python 爬虫向百度发起请求的时候, 顺便在请求里面写上 User-Agent, 表明自己是浏览器君.<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        在 GET 的时候添加 header 有很多方法, 下面介绍两种方法.<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        第一种方法比较简便直接, 但是不好扩展功能, 代码如下:<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><img src="/Public/forum/ueditor/image/20170928/1506562572912888.png" title="1506562572912888.png" alt="image.png"/> <p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><strong style="border: 0px; margin: 0px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        第二种方法使用了 build_opener 这个方法, 用来自定义 opener, 这种方法的好处是可以方便的拓展功能, 例如下面的代码就拓展了自动处理 Cookies 的功能. <p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><img src="/Public/forum/ueditor/image/20170928/1506562593632248.png" title="1506562593632248.png" alt="image.png"/><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        上述代码运行后通过 Fiddler 抓到的 GET 报文如下所示:<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><img src="/Public/forum/ueditor/image/20170928/1506562800996408.png" title="1506562800996408.png" alt="image.png"/><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        可见我们在代码里写的东西都添加到请求报文里面了.<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><h2 style="border: 0px; margin: 0px 0px 20px; padding: 0px; font-size: 24px; font-stretch: normal; line-height: 36px; font-family: "Microsoft YaHei", "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; color: rgb(46, 46, 46); white-space: normal; background-color: rgb(255, 255, 255);">     保存抓回来的报文</h2><p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">        顺便说说文件操作. Python 的文件操作还是相当方便的. 我们可以讲抓回来的数据 data 以二进制形式保存, 也可以经过 decode() 处理成为字符串后以文本形式保存. 改动一下打开文件的方式就能用不同的姿势保存文件了. 下面是参考代码:<p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);"><img src="/Public/forum/ueditor/image/20170928/1506562868408487.png" title="1506562868408487.png" alt="image.png"/> <p style="border: 0px; margin-top: 0px; margin-bottom: 20px; padding: 0px; font-size: 15px; color: rgb(46, 46, 46); font-family: "Microsoft YaHei", 宋体, "Myriad Pro", Lato, "Helvetica Neue", Helvetica, Arial, sans-serif; white-space: normal; background-color: rgb(255, 255, 255);">
选择版块:

关注微信公众号，可下载APP应用。

零基础自学用Python 3开发网络爬虫(三): 伪装浏览器君

添加超时跳过功能

支持自动跳转

伪装浏览器正规军

保存抓回来的报文

版主推荐

精品在线课程【一线专家讲授+24小时内答疑+永久免费观看+市场1/10价格】

共有1条评论

明星会员

课程推荐

楼主关注

版主推荐

热门贴子

发布新贴

编辑帖子

移动帖子x

粤ICP备13047178号粤公网安备44010602001432号

广州挪贤计算机科技有限公司版权所有

Copyright @ 2013-2025 KokoJia.com Inc. All Rights Reserved.

关注微信公众号，可下载APP应用。

零基础自学用Python 3开发网络爬虫(三): 伪装浏览器君

添加超时跳过功能

支持自动跳转

伪装浏览器正规军

保存抓回来的报文

版主推荐

精品在线课程【一线专家讲授+24小时内答疑+永久免费观看+市场1/10价格】

共有1条评论

明星会员

课程推荐

楼主关注

版主推荐

热门贴子

发布新贴

编辑帖子

移动帖子x

粤ICP备13047178号 粤公网安备44010602001432号

广州挪贤计算机科技有限公司 版权所有

Copyright @ 2013-2025 KokoJia.com Inc. All Rights Reserved.

粤ICP备13047178号粤公网安备44010602001432号

广州挪贤计算机科技有限公司版权所有