colirx · colirx · Jan 18, 2025
diff --git a/source/_posts/07.python/02.爬虫/02.part2.md b/source/_posts/07.python/02.爬虫/02.part2.md
@@ -2,12 +2,137 @@
 title: 动态爬取 requests、多进程多线程爬取数据
 categories: 
   - python
-  - spider
 tags: 
   - spider
 author: causes
 date: 2024-11-17 10:27:30
 permalink: /pages/93515a/
 ---
-## 动态爬取 requests
+
+## 页面渲染
+
+常见的页面渲染过程有两种：
+
+- 服务器渲染，需要的数据直接在页面源代码能找到，比较容易理解。
+- JS 渲染，需要的数据在页面源代码中搜不到，服务器返回数据最后在前端动态加载。
+
+反爬虫的一般手段：
+
+- User-Agent
+
+    浏览器的标志信息，会通过请求头传递给服务器，用来说明访问数据的浏览器信息
+
+    反爬虫：先检查是否有 UA，或者 UA 是否合法
+
+- 代理 IP
+- 验证码访问
+- 动态加载网页
+- 数据加密
+- ……
+
+常见的 HTTP 状态码：
+
+- 200: 表示服务器请求成功并且返回客户端所请求的数据
+- 100 - 199: 指定客户端相应的某些动作
+- 200 - 299: 表示已经请求成功
+- 400 - 499: 用于指出客户端的错误，其中 404 代表请求失败、资源不存在或者没有找到
+- 500 - 599: 服务器遇到未知错误，导致无法完成客户端当前的请求
+
+## urllib
+
+urllib 可以实现请求的发送，只是操作方法不同，urllib 和 request 不同，比较简单，做个 demo 可以用。
+
+不需要安装，python 自带。
+
+1. 基本请求
+
+    ```python
+    import urllib.request
+
+    url = 'http://www.baidu.com'
+    # 发送网络请求
+    response = urllib.request.urlopen(url)
+    # 状态码
+    # 200
+    print(response.getcode())
+    # 获取 URL
+    # http://www.baidu.com
+    print(response.geturl())
+    # 获取请求头
+    # [('Bdpagetype', '1'), ……]
+    print(response.getheaders())
+    # 返回相应内容
+    response.read().decode('UTF-8')
+    # 下载数据，保存文件名称为 baidu.html
+    urllib.request.urlretrieve(url, filename='baidu.html')
+    ```
+
+1. 传入 headers
+
+    搜索
+
+    ```python
+    import urllib.request
+
+    url = 'https://www.sogou.com/web?query=%E9%A3%9E%E6%9C%BA'
+    # url 转码
+    # https://www.sogou.com/web?query=飞机
+    urllib.request.unquote(url)
+
+    # 如果直接使用上方的搜索条件去搜索就搜不到什么数据，这就是反爬虫
+    # 最常见的就是使用 UA 去反爬，所以我们加上请求头试试，请求头直接从浏览器网页中寻找即可
+    headers = {
+        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'
+    }
+    request = urllib.request.Request(url, headers=headers)
+    response = urllib.request.urlopen(request)
+    print(response.read().decode('UTF-8'))
+    ```
+
+    动态搜索
+
+    ```python
+    keyword = '飞机'
+    # url 编码
+    # https://www.sogou.com/web?query=%E9%A3%9E%E6%9C%BA
+    url = f'https://www.sogou.com/web?query={urllib.request.quote(keyword)}'
+    ```
+
+1. 传入 data，使用 post 请求数据
+
+    ```python
+    import urllib.request
+    import urllib.parse
+    import json
+
+    url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
+    headers = {
+        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15',
+    }
+
+    formData = {
+        'cname': '',
+        'pid': '',
+        'keyword': '北京',
+        'pageIndex': 1,
+        'pageSize': 10,
+    }
+    # 使用 post 方式，将字典转为 bytes 方式
+    formData = urllib.parse.urlencode(formData).encode('utf-8')
+    req = urllib.request.Request(url, data=formData, headers=headers)
+
+    response = urllib.request.urlopen(req)
+    print(response.getcode())
+    print(response.geturl())
+
+    data = json.loads(response.read().decode('utf-8'))
+    for i in data['Table1']:
+        print(i)
+    ```
+
+1. 抓取多页数据
+
+    简单来说，抓取多页数据直接修改参数，因为他的网址不会发生变化，只是传递的参数发生了变化
+
+## request