Python知識分享網(wǎng) - 專業(yè)的Python學(xué)習(xí)網(wǎng)站 學(xué)Python,上Python222
Python beautifulsoup網(wǎng)絡(luò)抓取和解析cnblog首頁帖子數(shù)據(jù)
發(fā)布于:2023-10-29 20:53:00

2024 一天掌握python爬蟲【基礎(chǔ)篇】 涵蓋 requests、beautifulsoup、selenium

https://www.bilibili.com/video/BV1Ju4y1Y7k6/

 

我們抓取下https://www.cnblogs.com/ 首頁所有的帖子信息,包括帖子標題,帖子地址,以及帖子作者信息。

首先用requests獲取網(wǎng)頁文件,然后再用bs4進行解析。

參考代碼:

import requests

url = "https://www.cnblogs.com/"

r = requests.get(url)

# 設(shè)置返回對象的編碼
r.encoding = "utf-8"

# print(r.text)

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'lxml')

article_list = soup.select("article.post-item")
# print(article_list)

for artile in article_list:
    print("==========")
    author = artile.find("a", class_="post-item-author")
    print(author.get_text())
    link = artile.find("a", class_="post-item-title")
    print(link.get_text())
    print(link.attrs["href"])

 

轉(zhuǎn)載自: