超简单小白爬虫急速五分钟上手教程

张开发
2026/4/3 18:51:32 15 分钟阅读
超简单小白爬虫急速五分钟上手教程
1. 思路1.首先明确你要爬什么这次举例电影榜单top100豆瓣最近不知道是不是服务器崩了一直打不开内容无非是电影名称 上映时间 评分 ---这就是我们需要爬取的具体信息2.写脚本2. 知识点1.只要会python里面的fp文件打开写入for循环2.csv直接excel表格就可以打开3.XPath路径语法理解1..(点)含义表示当前节点位置也就是dd内。2.//(双斜杠)含义表示不限层级的搜索也就是不管嵌套3.p[classname]含义找到一个class属性等于name的p标签。p是标签名。[]里面是过滤条件。符号表示是“属性”。4./a(单斜杠 a)含义直接子级也就是找到那个标签下的a。5./text()含义取文本内容也就是夹在中间的那些文字电影名。连起来. // p[classname] / a / text()3. 具体过程以猫眼电影举例https://www.maoyan.com/board/4?offset101.导包import csv import requests # 用于发送网络请求 from lxml import etree # lxml第三方网页解析库2.拿到html网页所有信息我们爬的是一个网页那么怎么让python拿到所需网页的具体内容那就得用到request包了负责响应我们的请求具体代码url fhttps://www.maoyan.com/board/4?offset{x} headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36, #为了伪装人类 Cookie: ...., #这里的cookie放你自己的那个f12里面的network就可以查看 Referer: https://www.maoyan.com/board/4, #可选为了伪装人类 Host: www.maoyan.com, #可选为了伪装人类 } resp requests.get(url, headersheaders, timeout10) html resp.text ​ parse etree.HTML(html) //lxml库中的etree可以解析网页拿到html代码3.筛选拿到的html信息为了找到我们所需的电影具体信息打开url按f12进入查看器找到我们所需要的电影名称上映时间评分发现ddDefinition Description容器存着每一部电影打开看细节以电影名举例这里使用本人自创简写路径格式dd/classname的p/a这个东西要是拿到了就拿到了电影标题转换成. // p[classname] / a / text()具体代码all_dd parse.xpath(//dd) ​ for dd in all_dd: name dd.xpath(.//p[classname]/a/text())[0].strip() rtime dd.xpath(.//p[classreleasetime]/text())[0].strip() integ dd.xpath(.//i[classinteger]/text())[0].strip() fra dd.xpath(.//i[classfraction]/text())[0].strip() score integ fra ​ movie_info { name: name, time: rtime, score: score, } print(movie_info)4.综合代码import csv import requests # 用于发送网络请求 from lxml import etree # lxml第三方网页解析库 ​ ​ for x in range(0, 100, 10): url fhttps://www.maoyan.com/board/4?offset{x} headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36, Cookie: ..., Referer: https://www.maoyan.com/board/4, Host: www.maoyan.com, } resp requests.get(url, headersheaders, timeout10) html resp.text parse etree.HTML(html) all_dd parse.xpath(//dd) for dd in all_dd: name dd.xpath(.//p[classname]/a/text())[0].strip() rtime dd.xpath(.//p[classreleasetime]/text())[0].strip() integ dd.xpath(.//i[classinteger]/text())[0].strip() fra dd.xpath(.//i[classfraction]/text())[0].strip() score integ fra ​ movie_info { name: name, time: rtime, score: score, } print(movie_info)​5.转换成表格csvimport csv import requests # 用于发送网络请求 from lxml import etree # lxml第三方网页解析库 ​ with open(movie.csv, w, encodingutf_8_sig, newline) as fp: fieldnames [name, time, score] writer csv.DictWriter(fp, fieldnamesfieldnames) writer.writeheader() ​ for x in range(0,100,10): url fhttps://www.maoyan.com/board/4?offset{x} headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36, Cookie: ..., Referer: https://www.maoyan.com/board/4, Host: www.maoyan.com, } resp requests.get(url,headersheaders,timeout10) html resp.text parse etree.HTML(html) all_dd parse.xpath(//dd) for dd in all_dd: name dd.xpath(.//p[classname]/a/text())[0].strip() rtime dd.xpath(.//p[classreleasetime]/text())[0].strip() integ dd.xpath(.//i[classinteger]/text())[0].strip() fra dd.xpath(.//i[classfraction]/text())[0].strip() score integ fra ​ movie_info { name: name, time: rtime, score: score, } writer.writerow(movie_info) ​

更多文章