91精品国产亚一区二区三区,污污污软件网站正能量入口,国产男女乱淫真视频全程播放

使用Python做簡易爬蟲爬取B站評論

匿名網(wǎng)友發(fā)布于：2023-07-21 14:59:17

(侵權(quán)舉報)

一、前言

B站評論沒有查找功能，就隨手寫了一個爬蟲爬取B站評論存儲到本地txt中

首先需要安裝python的request庫，和beautifulsoup庫

pip install requests

pip install bs4

出現(xiàn)successfully就代表安裝成功了

下面就是所需的所有庫

import requests
from bs4 import BeautifulSoup
import re
import json
from pprint import pprint
import time

二、分析網(wǎng)頁

我們在頁面中查看源代碼，發(fā)現(xiàn)源代碼中并沒有有關(guān)評論的信息。我們繼續(xù)往下滑到評論的位置，發(fā)現(xiàn)評論是需要加載一會才出現(xiàn)，這時候我就猜測需要抓包才能獲取到評論的信息。

打開F12，在network中查詢reply有關(guān)選項，查找到了評論信息。

使用Python做簡易爬蟲爬取B站評論圖1

我提取出URL，查看里面的各項數(shù)據(jù)

使用Python做簡易爬蟲爬取B站評論圖2

不知道為什么這里的URL需要刪除掉Callback后面的數(shù)據(jù)才能正常查看

在Edge里下載Json Formatter可以更好的查看。

使用Python做簡易爬蟲爬取B站評論圖3

發(fā)現(xiàn)一個包并不能顯示所有的評論，我們繼續(xù)往下滑，在F12尋找有關(guān)reply的數(shù)據(jù)，提取出URL

使用Python做簡易爬蟲爬取B站評論圖4

發(fā)現(xiàn)只有next會改變，那么next=1是什么？實踐發(fā)現(xiàn)next=1和next=0的數(shù)據(jù)一樣，所以我們編程序的時候可以直接從1開始。

但是我們又發(fā)現(xiàn)這里面只有根評論沒有子評論，懷疑子評論在另一個包中，查看其中一個評論的子評論，我們又在F12中抓到了一個新包。

同樣我們提取URL，觀察replies就是所需要的子評論。同樣一頁也不能顯示完所有回復，觀察后發(fā)現(xiàn)，各個評論只有pn不一樣。

使用Python做簡易爬蟲爬取B站評論圖6

那么子評論和根評論是怎么聯(lián)系在一起的呢？

觀察URL，發(fā)現(xiàn)子評論的URL有root這項，我們就去研究了根和子的一致，發(fā)現(xiàn)根的rpid就是子的root，這樣我們就找到了關(guān)系。

使用Python做簡易爬蟲爬取B站評論圖7

最后在寫代碼的時候還發(fā)現(xiàn)有個問題，就是有些根評論不需要展開，那么子評論的包中replies這一項就是空的，而這些評論的信息存在梗評論的包中，我們只需要簡單判斷一下就可以了。

了解完結(jié)構(gòu)后，編程就簡單多了。

三、代碼

1.頭

#網(wǎng)頁頭
headers = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
    "referer" : "https://www.bilibili.com/"
}

2.獲取根評論

def get_rootReply(headers):
    num = 1
    replay_index = 1
    while True:
        URL = (f"https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next={num}&type=1&oid=470113786&mode=3&plat=1&_=1680096302818")   #獲得網(wǎng)頁源碼
        respond = requests.get(URL , headers = headers) # 獲得源代碼 抓包
        # print(respond.status_code)
        reply_num = 0
        if(respond.status_code == 200): # 如果響應(yīng)為200就繼續(xù)，否則退出
            respond.encoding = "UTF-8"
            html = respond.text
            json_html = json.loads(html)    # 把格式轉(zhuǎn)化為json格式 一個是好讓pprint打印，一個是好尋找關(guān)鍵代碼
            
            if json_html['data']['replies'] is None or len(json_html['data']['replies']) == 0 :
                break
 
            for i in range(0,len(json_html['data']['replies'])):   #一頁只能讀取20條評論
                reply = json_html['data']['replies'][reply_num]['content']['message']
                root = json_html['data']['replies'][reply_num]['rpid']
                reply = reply.replace('\n',',')
                # print(reply)
                file.write(str(replay_index) + '.' + reply + '\n')
                if json_html['data']['replies'][reply_num]['replies'] is not None:
                    if(get_SecondReply(headers,root) == 0):
                        for i in range(0,len(json_html['data']['replies'][reply_num]['replies'])):
                            reply = json_html['data']['replies'][reply_num]['replies'][i]['content']['message']
                            reply = reply.replace('\n',',')
                            file.write("        " + reply + '\n')
                reply_num += 1 
                replay_index += 1
            num += 1
            
            time.sleep(0.5)
        else :
            print("respond error!")
            break
    file.close()

3.獲取子評論

def get_SecondReply(headers,root):
    pn = 1
    while True:
        URL = (f"https://api.bilibili.com/x/v2/reply/reply?jsonp=jsonp&pn={pn}&type=1&oid=824175427&ps=10&root={root}&_=1679992607971")
        respond = requests.get(URL , headers = headers) # 獲得源代碼 抓包
        reply_num = 0
        if(respond.status_code == 200):
            respond.encoding = "UTF-8"
            html = respond.text
            json_html = json.loads(html)
            
            if json_html['data']['replies'] is None:
                if(pn == 1):
                    return 0
                else :
                    return 1
            
            for i in range(0,len(json_html['data']['replies'])):
                if json_html['data']['replies'] is None:
                    break
                reply = json_html['data']['replies'][reply_num]['content']['message']
                reply = reply.replace('\n',',')
                # print(reply)
                reply_num += 1
                file.write("        " + reply + '\n')
            pn += 1
            time.sleep(0.5)
        else:
            print("Sreply error!")
            exit(-1)

這樣各個模塊就集齊了

四、總代碼

import requests
from bs4 import BeautifulSoup
import re
import json
from pprint import pprint
import time
 
#網(wǎng)頁頭
headers = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
    "referer" : "https://www.bilibili.com/"
}
 
file = open('lanyin.txt', 'w',encoding='utf-8')
 
def get_SecondReply(headers,root):
    pn = 1
    while True:
        URL = (f"https://api.bilibili.com/x/v2/reply/reply?jsonp=jsonp&pn={pn}&type=1&oid=824175427&ps=10&root={root}&_=1679992607971")
        respond = requests.get(URL , headers = headers) # 獲得源代碼 抓包
        reply_num = 0
        if(respond.status_code == 200):
            respond.encoding = "UTF-8"
            html = respond.text
            json_html = json.loads(html)
            
            if json_html['data']['replies'] is None:
                if(pn == 1):
                    return 0
                else :
                    return 1
            
            for i in range(0,len(json_html['data']['replies'])):
                if json_html['data']['replies'] is None:
                    break
                reply = json_html['data']['replies'][reply_num]['content']['message']
                reply = reply.replace('\n',',')
                # print(reply)
                reply_num += 1
                file.write("        " + reply + '\n')
            pn += 1
            time.sleep(0.5)
        else:
            print("Sreply error!")
            exit(-1)
 
    
def get_rootReply(headers):
    num = 1
    replay_index = 1
    while True:
        URL = (f"https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next={num}&type=1&oid=470113786&mode=3&plat=1&_=1680096302818")   #獲得網(wǎng)頁源碼
        respond = requests.get(URL , headers = headers) # 獲得源代碼 抓包
        # print(respond.status_code)
        reply_num = 0
        if(respond.status_code == 200): # 如果響應(yīng)為200就繼續(xù)，否則退出
            respond.encoding = "UTF-8"
            html = respond.text
            json_html = json.loads(html)    # 把格式轉(zhuǎn)化為json格式 一個是好讓pprint打印，一個是好尋找關(guān)鍵代碼
            
            if json_html['data']['replies'] is None or len(json_html['data']['replies']) == 0 :
                break
 
            for i in range(0,len(json_html['data']['replies'])):   #一頁只能讀取20條評論
                reply = json_html['data']['replies'][reply_num]['content']['message']
                root = json_html['data']['replies'][reply_num]['rpid']
                reply = reply.replace('\n',',')
                # print(reply)
                file.write(str(replay_index) + '.' + reply + '\n')
                if json_html['data']['replies'][reply_num]['replies'] is not None:
                    if(get_SecondReply(headers,root) == 0):
                        for i in range(0,len(json_html['data']['replies'][reply_num]['replies'])):
                            reply = json_html['data']['replies'][reply_num]['replies'][i]['content']['message']
                            reply = reply.replace('\n',',')
                            file.write("        " + reply + '\n')
                reply_num += 1 
                replay_index += 1
            num += 1
            
            time.sleep(0.5)
        else :
            print("respond error!")
            break
    file.close()
 
if __name__ == '__main__':
    get_rootReply(headers)
    print("sucessful")

五、總結(jié)

自己隨手寫的代碼，比較垃圾，歡迎大佬指正。

轉(zhuǎn)載自：https://blog.csdn.net/ClushioAqua/article/details/129834114

使用
python
做
簡易
爬蟲
爬
取
b
站
評論

三、代碼

1.頭

2.獲取根評論

3.獲取子評論

四、總代碼

五、總結(jié)

熱門帖子推薦

相關(guān)帖子推薦

熱門標簽推薦

三、代碼

1.頭

2.獲取根評論

3.獲取子評論

四、總代碼

五、總結(jié)

熱門帖子推薦

相關(guān)帖子推薦

熱門標簽推薦

四、總代碼

五、總結(jié)