Web-crawler in Python

What is Web-crawler?

  • 一種機器人,自動爬取目標網頁並依需求蒐集目標資料
  • 步驟:
    1. 取得指定網域下的HTML資料
    2. 解析這些資料以取得目標資料
    3. Loop
  • 套件簡介:
    • requests => 用來對目標網頁的server發出request,底層為urllib
    • BeautifulSoup => 用來解析html,底層為re
    • pandas => 爬取表格
    • selenium => 網頁測試工具,用來應付較麻煩的JavaScript
    • re => 正則表達式,用以取技術性較高的文字段落

LABS

Before LAB

  • 於Windows的環境下,先將提示字令元以管理者身分打開,輸入:
    1
    2
    pip3 install requests
    pip3 install beautifulsoup4
    安裝必要套件
  • 於Linux環境下,打開terminal以root的身分執行上述指令,以安裝必要條件。

Lab0 - 查看網站回傳的狀態碼

  • 確認網頁回傳的狀況

    1
    2
    3
    4
    5
    import requests

    url = 'https://ck101.com/forum.php?mod=viewthread&tid=4349663&extra=page%3D8'
    response = requests.get(url)
    print(response)
  • 網頁回傳狀況(HTTP狀態碼)有幾種可能:

  • 運行完上述程式後大多會給予403狀態碼,因為有些網頁阻擋爬蟲,故給予403狀態碼
    =>解決辦法:將爬蟲偽裝成瀏覽器進行網頁存取

1
2
3
4
5
6
7
import requests

url = 'https://ck101.com/forum.php?mod=viewthread&tid=4349663&extra=page%3D8'
fake_browser = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}
response = requests.get(url, headers = fake_browser)

print(response)
  • 如此便能通過網頁並順利存取其內容

Lab1-1 - 複製網頁

1
2
3
4
5
6
7
8
9
10
import requests

url = 'https://www.gamer.com.tw/'
response = requests.request('get', url)

file_name = 'gamer.html'
with open(file_name, 'w', encoding='utf-8') as f:
f.write(response.text)

print('Success!')
  • with ... as <file_variable>: 等同於 <file_variable> = ...

Lab1-2 - 搜尋網頁內元素

1
2
3
4
5
6
7
8
9
10
11
12
import requests
from bs4 import BeautifulSoup

url = 'http://www.gamer.com.tw'
response = requests.request('get', url)

# 將html文字轉成BeautifulSoup物件
soup = BeautifulSoup(response.text, 'html.parser')

# 這樣就能用它搜尋裡面的內容
title = soup.find('title').text
print(title)
  • 除了透過HTML元素外,亦可透過CSS選擇器
1
2
3
4
5
6
7
8
9
10
11
12
13
import requests
from bs4 import BeautifulSoup

url = 'https://www.gamer.com.tw'
response = requests.request('get', url)

soup = BeautifulSoup(response.text, 'html.parser')

# 或是可用CSS選擇器
side_titles = soup.select('.BA-left li a')

for title in side_titles:
print(title.text)

Lab2-1 - 爬取網頁文字

1
2
3
4
5
6
7
8
9
10
11
12
import re
import requests
from bs4 import BeautifulSoup

url = 'https://ck101.com/thread-4284476-1-1.html'
fake_browser = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}
response = requests.get(url, headers = fake_browser)

soup = BeautifulSoup(response.text, 'html.parser')
response_crawling = soup.find('table', id = re.compile('pid109830383')).find('td', id = re.compile('postmessage_109830383'))

print(response_crawling.text)
  • print(rr)則會連著HTML標籤一起爬取

Lab2-2 - 將網頁文字存成檔案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re
import requests
from bs4 import BeautifulSoup

url = 'https://ck101.com/thread-4284476-1-1.html'
fake_browser = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}
response = requests.get(url, headers = header)

soup = BeautifulSoup(response.text, 'html.parser')
response_crawling = soup.find('table', id = re.compile('pid109830383')).find('td', id = re.compile('postmessage_109830383'))

file_name = 'Lab2_text.txt'
file = open(file_name, 'w')
file.write(response_crawling.text)
file.close()
  • file變數中open函數中,第一個參數為檔案名稱,如果該檔以存在則複寫,若該檔不存在則建立一新的檔案

Lab3-1 - 爬取網頁圖片並儲存

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re
import requests
from bs4 import BeautifulSoup

url = 'https://mobile.dcard.tw/f/pet/p/228155814'
fake_header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}
response = requests.get(url, headers = fake_header)
soup = BeautifulSoup(response.text, 'html.parser')
picture = soup.find('img', 'GalleryImage__Image-iw2fq7-0 vXfwx')
response_crawling = requests.get(picture['src'])

print(response_crawling.content)

file = open('Lab3_img.jpg', 'wb')
file.write(response_crawling.content)

Lab3-2 - 將所有網頁圖片存成檔案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import re
import requests
from bs4 import BeautifulSoup

url = 'https://mobile.dcard.tw/f/pet/p/228155814'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}
req = requests.get(url, headers = header)
soup = BeautifulSoup(req.text, 'html.parser')

#pic = soup.find('img', 'GalleryImage_image_1I6fi')
#rr = requests.get(pic['src'])
#file = open('Lab3_img.jpg', 'wb')
#file.write(rr.content)

cnt = 0
images = soup.find_all('img', 'GalleryImage__Image-iw2fq7-0 vXfwx')
for i in images:
route = 'Lab3-allimg/'
filename = 'Lab_img' + str(cnt) + '.jpg'
rr = requests.get(i['src'])
file = open(route + filename, 'wb')
file.write(rr.content)
cnt += 1

file.flush()
file.close()

References:

  1. Python爬蟲實戰
  2. 網路爬蟲Day1 - 概述
  3. Python爬蟲新手日記
  • Copyrights © 2019-2021 NIghTcAt

請我喝杯咖啡吧~