Day 45 - Beautiful Soup & Web scraping & robots.txt


Posted by pei_______ on 2022-05-26

learning from 100 Days of Code: The Complete Python Pro Bootcamp for 2022


Beautiful Soup Documentation


robots.txt

遵循漫遊器排除標準的純文字檔案,其中包含一或多項規則。

這些規則的作用是禁止 (或開放) 特定檢索器存取位於網站中的某個檔案路徑。

除非您在 robots.txt 檔案中另行指定,否則系統將允許檢索所有檔案。

以下是一個包含兩項規則的簡單 robots.txt 檔案:

User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Allow: /

Sitemap: http://www.example.com/sitemap.xml

Beautiful Soup 基本語法

導入工具 & 網頁資料

  1. 如果被網站擋掉python requests的請求,就需要自訂header
  2. HTTP HEADER
from bs4 import BeautifulSoup
import requests

# html from your desk
with open('website.html', encoding="utf-8") as file:
    contents = file.read()

# html online
header = {
    "User-Agent": "537.36 (KHTML, like Gecko) Chrome",
    "Accept-Language": "zh-TW"
}

response = requests.get(url=URL, headers=header)
html_content = response.text

印出網頁的html資料

soup = BeautifulSoup(contents, 'html.parser')
html_code_beautified = soup.prettify()

印出網站中"全部"符合條件的資料

find_all_item = soup.find_all(name='h3')
select_item = soup.select('h3')

'''
[<h3 class="capital">FIRST_DIV</h3>, <h3>first_div</h3>, <h3 class="capital">SECOND_DIV</h3>]
[<h3 class="capital">FIRST_DIV</h3>, <h3>first_div</h3>, <h3 class="capital">SECOND_DIV</h3>]
'''

印出網站中"第一筆"符合條件的資料

find_item = soup.find(name='h3', class_='capital')
select_one_item = soup.select_one('.first_div h3')
in_h3_tag = soup.h3

'''
<h3 class="capital">FIRST_DIV</h3>
<h3 class="capital">FIRST_DIV</h3>
<h3 class="capital">FIRST_DIV</h3>
'''

印出tag內的文字text

print(select_one_item.string)
print(select_one_item.text)
print(select_one_item.getText())

'''
FIRST_DIV
FIRST_DIV
FIRST_DIV
'''

印出tag的屬性

  • 可用來印'href'對應的連結
print(select_one_item.get('class'))

'''
['capital']
'''

從yc combinator抓取人氣最高的新聞

from bs4 import BeautifulSoup
import requests

response = requests.get('https://news.ycombinator.com/')
yc_web_page = response.text

soup = BeautifulSoup(yc_web_page, 'html.parser')

articles = soup.select('.title .titlelink')
article_strings = []
article_links = []

for article in articles:
    article_string = article.getText()
    article_strings.append(article_string)
    article_link = article.get('href')
    article_links.append(article_link)

article_upvotes = [int(score.getText().split()[0]) for score in soup.find_all(name='span', class_='score')]
highest_score = max(article_upvotes)
largest_index = article_upvotes.index(highest_score)

print(article_strings[largest_index])
print(article_links[largest_index])

抓取前100名的電影

import requests
from bs4 import BeautifulSoup

URL = "https://web.archive.org/web/20200518073855/https://www.empireonline.com/movies/features/best-movies-2/"

response = requests.get(url=URL)
website_html = response.text
soup = BeautifulSoup(website_html, 'html.parser')

movies = [title.getText() for title in soup.select('.article-title-description__text .title')]
movies = movies[::-1]

with open('movie.txt', 'w', encoding='UTF-8') as file:
    for movie in movies:
        file.write(f"{movie}\n")

#Python #課堂筆記 #100 Days of Code







Related Posts

建立自動化檢查的 React App(Husky, prettier, eslint, lint-statge)

建立自動化檢查的 React App(Husky, prettier, eslint, lint-statge)

CSS保健室|border、outline

CSS保健室|border、outline

閉包

閉包


Comments