一、前言
B站評論沒有查找功能,就隨手寫了一個爬蟲爬取B站評論存儲到本地txt中
首先需要安裝python的request庫,和beautifulsoup庫
pip install requests
pip install bs4
出現(xiàn)successfully就代表安裝成功了
下面就是所需的所有庫
import requests
from bs4 import BeautifulSoup
import re
import json
from pprint import pprint
import time
二、分析網(wǎng)頁
我們在頁面中查看源代碼,發(fā)現(xiàn)源代碼中并沒有有關(guān)評論的信息。我們繼續(xù)往下滑到評論的位置,發(fā)現(xiàn)評論是需要加載一會才出現(xiàn),這時候我就猜測需要抓包才能獲取到評論的信息。
打開F12,在network中查詢reply有關(guān)選項,查找到了評論信息。
我提取出URL,查看里面的各項數(shù)據(jù)
不知道為什么這里的URL需要刪除掉Callback后面的數(shù)據(jù)才能正常查看
在Edge里下載Json Formatter可以更好的查看。
發(fā)現(xiàn)一個包并不能顯示所有的評論,我們繼續(xù)往下滑,在F12尋找有關(guān)reply的數(shù)據(jù),提取出URL
發(fā)現(xiàn)只有next會改變,那么next=1是什么?實踐發(fā)現(xiàn)next=1和next=0的數(shù)據(jù)一樣,所以我們編程序的時候可以直接從1開始。
但是我們又發(fā)現(xiàn)這里面只有根評論沒有子評論,懷疑子評論在另一個包中,查看其中一個評論的子評論,我們又在F12中抓到了一個新包。
同樣我們提取URL,觀察replies就是所需要的子評論。同樣一頁也不能顯示完所有回復,觀察后發(fā)現(xiàn),各個評論只有pn不一樣。
那么子評論和根評論是怎么聯(lián)系在一起的呢?
觀察URL,發(fā)現(xiàn)子評論的URL有root這項,我們就去研究了根和子的一致,發(fā)現(xiàn)根的rpid就是子的root,這樣我們就找到了關(guān)系。
最后在寫代碼的時候還發(fā)現(xiàn)有個問題,就是有些根評論不需要展開,那么子評論的包中replies這一項就是空的,而這些評論的信息存在梗評論的包中,我們只需要簡單判斷一下就可以了。
了解完結(jié)構(gòu)后,編程就簡單多了。
三、代碼
1.頭
#網(wǎng)頁頭
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"referer" : "https://www.bilibili.com/"
}
2.獲取根評論
def get_rootReply(headers):
num = 1
replay_index = 1
while True:
URL = (f"https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next={num}&type=1&oid=470113786&mode=3&plat=1&_=1680096302818") #獲得網(wǎng)頁源碼
respond = requests.get(URL , headers = headers) # 獲得源代碼 抓包
# print(respond.status_code)
reply_num = 0
if(respond.status_code == 200): # 如果響應(yīng)為200就繼續(xù),否則退出
respond.encoding = "UTF-8"
html = respond.text
json_html = json.loads(html) # 把格式轉(zhuǎn)化為json格式 一個是好讓pprint打印,一個是好尋找關(guān)鍵代碼
if json_html['data']['replies'] is None or len(json_html['data']['replies']) == 0 :
break
for i in range(0,len(json_html['data']['replies'])): #一頁只能讀取20條評論
reply = json_html['data']['replies'][reply_num]['content']['message']
root = json_html['data']['replies'][reply_num]['rpid']
reply = reply.replace('\n',',')
# print(reply)
file.write(str(replay_index) + '.' + reply + '\n')
if json_html['data']['replies'][reply_num]['replies'] is not None:
if(get_SecondReply(headers,root) == 0):
for i in range(0,len(json_html['data']['replies'][reply_num]['replies'])):
reply = json_html['data']['replies'][reply_num]['replies'][i]['content']['message']
reply = reply.replace('\n',',')
file.write(" " + reply + '\n')
reply_num += 1
replay_index += 1
num += 1
time.sleep(0.5)
else :
print("respond error!")
break
file.close()
3.獲取子評論
def get_SecondReply(headers,root):
pn = 1
while True:
URL = (f"https://api.bilibili.com/x/v2/reply/reply?jsonp=jsonp&pn={pn}&type=1&oid=824175427&ps=10&root={root}&_=1679992607971")
respond = requests.get(URL , headers = headers) # 獲得源代碼 抓包
reply_num = 0
if(respond.status_code == 200):
respond.encoding = "UTF-8"
html = respond.text
json_html = json.loads(html)
if json_html['data']['replies'] is None:
if(pn == 1):
return 0
else :
return 1
for i in range(0,len(json_html['data']['replies'])):
if json_html['data']['replies'] is None:
break
reply = json_html['data']['replies'][reply_num]['content']['message']
reply = reply.replace('\n',',')
# print(reply)
reply_num += 1
file.write(" " + reply + '\n')
pn += 1
time.sleep(0.5)
else:
print("Sreply error!")
exit(-1)
這樣各個模塊就集齊了
四、總代碼
import requests
from bs4 import BeautifulSoup
import re
import json
from pprint import pprint
import time
#網(wǎng)頁頭
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"referer" : "https://www.bilibili.com/"
}
file = open('lanyin.txt', 'w',encoding='utf-8')
def get_SecondReply(headers,root):
pn = 1
while True:
URL = (f"https://api.bilibili.com/x/v2/reply/reply?jsonp=jsonp&pn={pn}&type=1&oid=824175427&ps=10&root={root}&_=1679992607971")
respond = requests.get(URL , headers = headers) # 獲得源代碼 抓包
reply_num = 0
if(respond.status_code == 200):
respond.encoding = "UTF-8"
html = respond.text
json_html = json.loads(html)
if json_html['data']['replies'] is None:
if(pn == 1):
return 0
else :
return 1
for i in range(0,len(json_html['data']['replies'])):
if json_html['data']['replies'] is None:
break
reply = json_html['data']['replies'][reply_num]['content']['message']
reply = reply.replace('\n',',')
# print(reply)
reply_num += 1
file.write(" " + reply + '\n')
pn += 1
time.sleep(0.5)
else:
print("Sreply error!")
exit(-1)
def get_rootReply(headers):
num = 1
replay_index = 1
while True:
URL = (f"https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next={num}&type=1&oid=470113786&mode=3&plat=1&_=1680096302818") #獲得網(wǎng)頁源碼
respond = requests.get(URL , headers = headers) # 獲得源代碼 抓包
# print(respond.status_code)
reply_num = 0
if(respond.status_code == 200): # 如果響應(yīng)為200就繼續(xù),否則退出
respond.encoding = "UTF-8"
html = respond.text
json_html = json.loads(html) # 把格式轉(zhuǎn)化為json格式 一個是好讓pprint打印,一個是好尋找關(guān)鍵代碼
if json_html['data']['replies'] is None or len(json_html['data']['replies']) == 0 :
break
for i in range(0,len(json_html['data']['replies'])): #一頁只能讀取20條評論
reply = json_html['data']['replies'][reply_num]['content']['message']
root = json_html['data']['replies'][reply_num]['rpid']
reply = reply.replace('\n',',')
# print(reply)
file.write(str(replay_index) + '.' + reply + '\n')
if json_html['data']['replies'][reply_num]['replies'] is not None:
if(get_SecondReply(headers,root) == 0):
for i in range(0,len(json_html['data']['replies'][reply_num]['replies'])):
reply = json_html['data']['replies'][reply_num]['replies'][i]['content']['message']
reply = reply.replace('\n',',')
file.write(" " + reply + '\n')
reply_num += 1
replay_index += 1
num += 1
time.sleep(0.5)
else :
print("respond error!")
break
file.close()
if __name__ == '__main__':
get_rootReply(headers)
print("sucessful")
五、總結(jié)
自己隨手寫的代碼,比較垃圾,歡迎大佬指正。