一个Python案例带你掌握xpath数据解析方法
目录
xpath基本概念xpath解析原理环境安装如何实例化一个etree对象xpath(‘xpath表达式’)xpath爬取58二手房实例xpath图片解析下载实例xpath爬取全国城市名称实例xpath爬取简历模板实例xpath基本概念
xpath解析:最常用且最便捷高效的一种解析方式。通用性强。
xpath解析原理
1.实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中
2.调用etree对象中的xpath方法结合xpath表达式实现标签的定位和内容的捕获。
环境安装
pip install lxml
如何实例化一个etree对象
from lxml import etree
1.将本地的html文件中的远吗数据加载到etree对象中:
etree.parse(filePath)
2.可以将从互联网上获取的原码数据加载到该对象中:
etree.HTML(‘page_text")
xpath(‘xpath表达式’)
1./:表示的是从根节点开始定位。表示一个层级
2.//:表示多个层级。可以表示从任意位置开始定位
3.属性定位://div[@class="song"] tag[@attrName="attrValue"]
4.索引定位://div[@class="song"]/p[3] 索引从1开始的
5.取文本:
/text()获取的是标签中直系的文本内容//text()标签中非直系的文本内容(所有文本内容)6.取属性:/@attrName ==>img/src
xpath爬取58二手房实例
完整代码
from lxml import etree import requests if __name__ == "__main__": headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36" } url = "https://xa.58.com/ershoufang/" page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) div_list = tree.xpath("//section[@class="list"]/div") fp = open("./58同城二手房.txt","w",encoding="utf-8") for div in div_list: title = div.xpath(".//div[@class="property-content-title"]/h3/text()")[0] print(title) fp.write(title+"\n"+"\n")
xpath图片解析下载实例
完整代码
import requests,os from lxml import etree if __name__ == "__main__": headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36" } url = "https://pic.netbian.com/4kmeinv/" page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath("//div[@class="slist"]/ul/li/a") if not os.path.exists("./piclibs"): os.mkdir("./piclibs") for li in li_list: detail_url ="https://pic.netbian.com" + li.xpath("./img/@src")[0] detail_name = li.xpath("./img/@alt")[0]+".jpg" detail_name = detail_name.encode("iso-8859-1").decode("GBK") detail_path = "./piclibs/" + detail_name detail_data = requests.get(url=detail_url, headers=headers).content with open(detail_path,"wb") as fp: fp.write(detail_data) print(detail_name,"seccess!!")
xpath爬取全国城市名称实例
完整代码
import requests from lxml import etree if __name__ == "__main__": url = "https://www.aqistudy.cn/historydata/" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36", } page_text = requests.get(url=url,headers=headers).content.decode("utf-8") tree = etree.HTML(page_text) #热门城市 //div[@class="bottom"]/ul/li #全部城市 //div[@class="bottom"]/ul/div[2]/li a_list = tree.xpath("//div[@class="bottom"]/ul/li | //div[@class="bottom"]/ul/div[2]/li") fp = open("./citys.txt","w",encoding="utf-8") i = 0 for a in a_list: city_name = a.xpath(".//a/text()")[0] fp.write(city_name+"\t") i=i+1 if i == 6: i = 0 fp.write("\n") print("爬取成功")
xpath爬取简历模板实例
完整代码
import requests,os from lxml import etree if __name__ == "__main__": url = "https://sc.chinaz.com/jianli/free.html" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36", } page_text = requests.get(url=url,headers=headers).content.decode("utf-8") tree = etree.HTML(page_text) a_list = tree.xpath("//div[@class="box col3 ws_block"]/a") if not os.path.exists("./简历模板"): os.mkdir("./简历模板") for a in a_list: detail_url = "https:"+a.xpath("./@href")[0] detail_page_text = requests.get(url=detail_url,headers=headers).content.decode("utf-8") detail_tree = etree.HTML(detail_page_text) detail_a_list = detail_tree.xpath("//div[@class="clearfix mt20 downlist"]/ul/li[1]/a") for a in detail_a_list: download_name = detail_tree.xpath("//div[@class="ppt_tit clearfix"]/h1/text()")[0] download_url = a.xpath("./@href")[0] download_data = requests.get(url=download_url,headers=headers).content download_path = "./简历模板/"+download_name+".rar" with open(download_path,"wb") as fp: fp.write(download_data) print(download_name,"success!!")
以上就是一个Python案例带你掌握xpath数据解析方法的详细内容,更多关于Python xpath数据解析的资料请关注脚本之家其它相关文章!
X 关闭
X 关闭
- 15G资费不大降!三大运营商谁提供的5G网速最快?中国信通院给出答案
- 2联想拯救者Y70发布最新预告:售价2970元起 迄今最便宜的骁龙8+旗舰
- 3亚马逊开始大规模推广掌纹支付技术 顾客可使用“挥手付”结账
- 4现代和起亚上半年出口20万辆新能源汽车同比增长30.6%
- 5如何让居民5分钟使用到各种设施?沙特“线性城市”来了
- 6AMD实现连续8个季度的增长 季度营收首次突破60亿美元利润更是翻倍
- 7转转集团发布2022年二季度手机行情报告:二手市场“飘香”
- 8充电宝100Wh等于多少毫安?铁路旅客禁止、限制携带和托运物品目录
- 9好消息!京东与腾讯续签三年战略合作协议 加强技术创新与供应链服务
- 10名创优品拟通过香港IPO全球发售4100万股 全球发售所得款项有什么用处?