使用 Python + Selenium 模拟浏览器行为爬取数据

1221 words

6 minutes

2022-09-29

爬虫

/

selenium

Selenium#

Selenium 是一个 Web 自动化测试工具，最初是为网站自动化测试而开发，Selenium 可以直接调用浏览器，它支持所有主流的浏览器（包括 PhantomJS 这些无界面的浏览器），可以接收指令，让浏览器自动加载页面，获取需要的数据，甚至页面截屏等。

我们知道，传统的爬虫通过模拟 HTTP 请求来爬取站点信息，由于这种方式和浏览器访问差异比较明显，很多站点都会采取一些反爬的手段，而 Selenium 是通过模拟浏览器来爬取信息，其行为和用户几乎一样，反爬策略很难区分出请求到底是来自 Selenium 还是真实用户。

不过 Selenium 爬虫不足之处在于慢，如果你对爬虫的速度没有要求，或者请求数据源传输过程加密，获取的接口难以逆向解析，可以尝试使用 Selenium 获取数据。

准备工作#

安装 Selenium#

推荐直接使用 pip3 安装，执行如下命令即可：

1
pip3 install selenium

进入 Python 命令行交互模式，导入一下 Selenium 包，如果没有报错，则证明安装成功。

1
$ python3
2
>>> import selenium

WebDriver#

Selenium 的核心组件是 WebDriver ，WebDriver 可以通过调用浏览器提供的原生自动化 API 来驱动浏览器。主流浏览器都有相应的 WebDriver ，本次以 Chrome 浏览器为例，获取对应 WebDriver 及配置。

ChromeDriver 验证#

进入 ChromeDriver 下载地址，获取对应版本 ChromeDriver ChromeDriver - WebDriver for Chrome

ps：注意对应版本号，chrome地址栏输入chrome://version/ 查看自己的Chrome版本

下载完成后，将 ChromeDriver 放置到 D盘或者任意地址，之后可以运行以下脚本验证：

1
from selenium import webdriver
2
from time import sleep
3

4
browser = webdriver.Chrome(executable_path="D:/chromedriver.exe")
5
browser.get('https://www.baidu.com')
6
sleep(2)
7
browser.close()

如果运行完毕之后弹出来了一个 Chrome 浏览器并加载了百度页面，2 秒之后就关闭，那就证明没问题。

other browser webdriver

实例#

本次以中国海关进口食品境外生产企业注册信息页面为例。

前情提要#

原本以为这个页面可以直接爬取接口，先进入页面，查看接口结构。然后发现返回的数据进行过加密处理，无法正常显示数据。所以这个页面适合作为示例模板。

步骤#

获取页面数据#

观察页面可知主体是一个 table 表格，只需要定位到对应的 table 标签即可以获取数据源。

由于 selenium 更新至 4.4 版本后find_elements_by_xxx接口已弃用，本次使用新接口获取数据。

1
from selenium.webdriver.common.by import By
2

3
fruits = driver.find_element(By.ID, "fruits")
4
fruit = fruits.find_elements(By.CLASS_NAME,"tomatoes")

通过观察页面标签节点，可以获取 class 名称未 fixed-table-body 标签下的 table, 再依次获取 tr, td 解析构建列表。

1
def get_page_data(driver):
2
    """
3
    html_str: 入参参数, 指的是网页源码，HTML格式的
4
    """
5
    data_list = []
6
    try:
7
        """查找网页源码中的xpath，找到每一行的位置"""
8
        # el = driver.find_elements_by_xpath('//div[@class="fixed-table-container"]//div[@class="fixed-table-body"]/table/tbody') api已棄用
9
        tr_list = driver.find_elements(
10
            By.XPATH, '//div[@class="fixed-table-body"]/table/tbody/tr')
11
        for tr in tr_list:
12
            item = tr.find_elements(By.XPATH, './td')
13
            td_list = []
14
            """匹配對應td內容，無則為空"""
15
            for it in item:
16
                try:
17
                    col = it.text
18
                except:
19
                    col = ""
20
                td_list.append(col)
21
            data_list.append(td_list)
22
            # print('data_list..', data_list)
23
    except Exception as e:
24
        print(e)
25
    return data_list

控制点击下一页#

ActionChains 类记录了键盘、鼠标、笔等页面方法，可以帮助你更好的操作浏览器，例如下面的行为：

获取下一页的标签，并触发点击事件

1
clickable = driver.find_element(
2
            By.XPATH, '//li[@class="page-item page-next"]/a')
3
ActionChains(driver).click(clickable).perform()

需要注意的是点击下一页后，有一个数据请求的过程，这时候应该隐式等待几秒再爬取数据。

完整示例#

1
import re
2
import time
3

4
import openpyxl
5
from selenium import webdriver
6
from selenium.webdriver.common.by import By
7
from selenium.webdriver import ActionChains
8

9

10
def get_page_data(driver):
11
    data_list = []
12
    try:
13
        """查找网页源码中的xpath，找到每一行的位置"""
14
        # el = driver.find_elements_by_xpath('//div[@class="fixed-table-container"]//div[@class="fixed-table-body"]/table/tbody') api已棄用
15
        tr_list = driver.find_elements(
16
            By.XPATH, '//div[@class="fixed-table-body"]/table/tbody/tr')
17
        for tr in tr_list:
18
            item = tr.find_elements(By.XPATH, './td')
19
            td_list = []
20
            """匹配對應td內容，無則為空"""
21
            for it in item:
22
                try:
23
                    col = it.text
24
                except:
25
                    col = ""
26
                td_list.append(col)
27
            data_list.append(td_list)
28
            # print('data_list..', data_list)
29
    except Exception as e:
30
        print(e)
31
    return data_list
32

33
def write_excel(file_name, write_list):
34
    """
35
    将每一行的数据汇总的数组，进行遍历，依次写到excel中
36
    file_name: 入参参数, 指的是写入excel的名字
37
    write_list: 入参参数, 指的是写入excel的每一行汇总的数组
38
    """
39
    full_excel = openpyxl.Workbook()
40
    full_sheet = full_excel.active
41
    for i in range(0, len(write_list)):
42
        full_sheet.append(write_list[i])
43
    full_excel.save(file_name)
44

45
if __name__ == '__main__':
46
    url = "https://ciferquery.singlewindow.cn/"
47
    options = webdriver.ChromeOptions()
48
    options.add_argument('--proxy-server=http://127.0.0.1:7890')
49
    driver = webdriver.Chrome(executable_path="D:/chromedriver.exe",
50
                              chrome_options=options)
51
    driver.implicitly_wait(3)
52
    driver.get(url)
53

54
    every_page_result_list = []  # 空数组接受每一页的所有数据行汇总数据
55
    """循环每一页获取数据"""
56
    pages = 10
57
    for index in range(1, pages + 1):
58
        every_page_result_list = every_page_result_list + get_page_data(driver)
59
    # 获取下一页按钮
60
        clickable = driver.find_element(
61
            By.XPATH, '//li[@class="page-item page-next"]/a')
62
        ActionChains(driver).click(clickable).perform()
63
        driver.implicitly_wait(1)
64
        print("获取第{0}页成功...".format(index))
65
    """这里是文件excel写入路径，你可以指定任意存在或者不存在的文件"""
66
    write_excel(r"D:\test.xlsx", every_page_result_list)
67

68
    time.sleep(3)
69
    driver.quit()