learning from 100 Days of Code: The Complete Python Pro Bootcamp for 2022
遵循漫遊器排除標準的純文字檔案,其中包含一或多項規則。
這些規則的作用是禁止 (或開放) 特定檢索器存取位於網站中的某個檔案路徑。
除非您在 robots.txt 檔案中另行指定,否則系統將允許檢索所有檔案。
以下是一個包含兩項規則的簡單 robots.txt 檔案:
User-agent: Googlebot
Disallow: /nogooglebot/
User-agent: *
Allow: /
Sitemap: http://www.example.com/sitemap.xml
Beautiful Soup 基本語法
導入工具 & 網頁資料
- 如果被網站擋掉python requests的請求,就需要自訂header
- HTTP HEADER
from bs4 import BeautifulSoup
import requests
# html from your desk
with open('website.html', encoding="utf-8") as file:
contents = file.read()
# html online
header = {
"User-Agent": "537.36 (KHTML, like Gecko) Chrome",
"Accept-Language": "zh-TW"
}
response = requests.get(url=URL, headers=header)
html_content = response.text
印出網頁的html資料
soup = BeautifulSoup(contents, 'html.parser')
html_code_beautified = soup.prettify()
印出網站中"全部"符合條件的資料
find_all_item = soup.find_all(name='h3')
select_item = soup.select('h3')
'''
[<h3 class="capital">FIRST_DIV</h3>, <h3>first_div</h3>, <h3 class="capital">SECOND_DIV</h3>]
[<h3 class="capital">FIRST_DIV</h3>, <h3>first_div</h3>, <h3 class="capital">SECOND_DIV</h3>]
'''
印出網站中"第一筆"符合條件的資料
find_item = soup.find(name='h3', class_='capital')
select_one_item = soup.select_one('.first_div h3')
in_h3_tag = soup.h3
'''
<h3 class="capital">FIRST_DIV</h3>
<h3 class="capital">FIRST_DIV</h3>
<h3 class="capital">FIRST_DIV</h3>
'''
印出tag內的文字text
print(select_one_item.string)
print(select_one_item.text)
print(select_one_item.getText())
'''
FIRST_DIV
FIRST_DIV
FIRST_DIV
'''
印出tag的屬性
- 可用來印'href'對應的連結
print(select_one_item.get('class'))
'''
['capital']
'''
從yc combinator抓取人氣最高的新聞
from bs4 import BeautifulSoup
import requests
response = requests.get('https://news.ycombinator.com/')
yc_web_page = response.text
soup = BeautifulSoup(yc_web_page, 'html.parser')
articles = soup.select('.title .titlelink')
article_strings = []
article_links = []
for article in articles:
article_string = article.getText()
article_strings.append(article_string)
article_link = article.get('href')
article_links.append(article_link)
article_upvotes = [int(score.getText().split()[0]) for score in soup.find_all(name='span', class_='score')]
highest_score = max(article_upvotes)
largest_index = article_upvotes.index(highest_score)
print(article_strings[largest_index])
print(article_links[largest_index])
抓取前100名的電影
import requests
from bs4 import BeautifulSoup
URL = "https://web.archive.org/web/20200518073855/https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(url=URL)
website_html = response.text
soup = BeautifulSoup(website_html, 'html.parser')
movies = [title.getText() for title in soup.select('.article-title-description__text .title')]
movies = movies[::-1]
with open('movie.txt', 'w', encoding='UTF-8') as file:
for movie in movies:
file.write(f"{movie}\n")