Python异步爬取知乎热榜实例分享

来源：脚本之家时间：2022-04-12 08:54:34

一、错误代码：摘要和详细的url获取不到

import asyncio
from bs4 import BeautifulSoup
import aiohttp
 
headers={
    "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
    "referer": "https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C"
}
async def getPages(url):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as resp:
            print(resp.status)  # 打印状态码
            html=await resp.text()
    soup=BeautifulSoup(html,"lxml")
    items=soup.select(".HotList-item")
    for item in items:
        title=item.select(".HotList-itemTitle")[0].text
        try:
            abstract=item.select(".HotList-itemExcerpt")[0].text
        except:
            abstract="No Abstract"
        hot=item.select(".HotList-itemMetrics")[0].text
        try:
            img=item.select(".HotList-itemImgContainer img")["src"]
        except:
            img="No Img"
        print("{}\n{}\n{}".format(title,abstract,img))
 
if __name__ == "__main__":
    url="https://www.zhihu.com/billboard"
    loop=asyncio.get_event_loop()
    loop.run_until_complete(getPages(url))
    loop.close()

二、查看JS代码

发现详细链接、图片链接、问题摘要等都在JS里面（CSDN的开发者助手插件确实好用）

正则表达式获取上述信息:

接下来就是详细的代码啦

import asyncio
import json
import re
import aiohttp
 
headers={
    "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
    "referer": "https://www.baidu.com/s?tn=02003390_43_hao_pg&isource=infinity&iname=baidu&itype=web&ie=utf-8&wd=%E7%9F%A5%E4%B9%8E%E7%83%AD%E6%A6%9C"
}
async def getPages(url):
    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.get(url) as resp:
            print(resp.status)  # 打印状态码
            html=await resp.text()
 
    regex=re.compile(""hotList":(.*?),"guestFeeds":")
    text=regex.search(html).group(1)
    # print(json.loads(text))   # json换成字典格式
    for item in json.loads(text):
        title=item["target"]["titleArea"]["text"]
        question=item["target"]["excerptArea"]["text"]
        hot=item["target"]["metricsArea"]["text"]
        link=item["target"]["link"]["url"]
        img=item["target"]["imageArea"]["url"]
        if not img:
            img="No Img"
        if not question:
            question="No Abstract"
        print("Title：{}\nPopular：{}\nQuestion：{}\nLink：{}\nImg：{}".format(title,hot,question,link,img))
 
if __name__ == "__main__":
    url="https://www.zhihu.com/billboard"
    loop=asyncio.get_event_loop()
    loop.run_until_complete(getPages(url))
    loop.close()

到此这篇关于Python异步爬取知乎热榜实例分享的文章就介绍到这了,更多相关Python异步爬取内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！

关键词：错误代码希望大家相关文章正则表达式

上一篇：Python爬虫之网络请求

下一篇：Python爬取城市租房信息实战分享

为你推荐

X 关闭

网络

X 关闭

专题

商用

目录

一、错误代码：摘要和详细的url获取不到

二、查看JS代码