爬虫之验证码相关 | 涂寐's Blogs

声明

本教程仅供学习参考，请勿用在非法途径上，违者后果自负，与笔者无关。 –涂寐

识别方式

人工识别
智能识别
- http://fast.95man.com/

云打码使用

根据官方开发文档调用

登录古诗文网

简要概括

登陆界面获取验证码图片
调用验证码识别平台接口识别
勾选 Network –> Preserve log –> 捕获登录请求
分析后模拟登录，为保证会话的持续连接，使用requests.session().post()/get() 发起请求

实战笔记

import os
import requests
from lxml import etree


class KSClient(object):

    def __init__(self):
        self.username = ''

        self.Token = ''

        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    # 获取taken
    def GetTaken(self, username, passord):
        brtn = False
        r = requests.get(
            'http://api.95man.com:8888/api/Http/UserTaken?user=' + username + '&pwd=' + passord + '&isref=0',
            headers=self.headers)
        arrstr = r.text.split('|')
        if (arrstr[0] == '1'):
            self.username = username
            self.Token = arrstr[1]
            brtn = True
        return brtn

    # 识别图片
    def PostPic(self, filepath, codetype):
        """
        imbyte: 图片字节
        imgtype: 类型 1为通用类型 更多精准类型请参考 http://fast.net885.com/auth/main.html
        """
        strRtn = ''
        imbyte = open(filepath, 'rb').read()
        filename = os.path.basename(filepath)

        files = {'imgfile': (filename, imbyte)}
        r = requests.post(
            'http://api.95man.com:8888/api/Http/Recog?Taken=' + self.Token + '&imgtype=' + str(codetype) + '&len=0',
            files=files, headers=self.headers)
        arrstr = r.text.split('|')
        # 返回格式：识别ID|识别结果|用户余额
        if (int(arrstr[0]) > 0):
            strRtn = arrstr[1]

        return strRtn

    # 识别报错
    def ReportError(self, imageid):
        """
        imageid:报错题目的图片ID
        """
        r = requests.get('http://api.95man.com:8888/api/Http/ReportErr?Taken=' + self.Token + '&ImgID=' + str(imageid),
                         headers=self.headers)
        arrstr = r.text.split('|')
        if (arrstr[0] == '1'):
            print('报错成功！')
        else:
            print('报错失败，错误信息：' + arrstr[1])


if __name__ == '__main__':

    session = requests.Session()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
    # 捕获登陆界面
    url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
    page_text = session.get(url=url, headers=headers).text
    tree = etree.HTML(page_text)
    code_img_src = 'https://so.gushiwen.cn' + tree.xpath('//*[@id="imgCode"]/@src')[0]
    # 捕获到验证码图片
    img_data = session.get(url=code_img_src, headers=headers).content
    with open('./a.jpg', 'wb') as fp:
        fp.write(img_data)
    # 调用验证码识别平台接口识别
    Ks95man = KSClient()
    if Ks95man.GetTaken('ceroxg28594@chacuo.net', 'ceroxg28594@chacuo.net'):
        # 获取成功,taken获取一次就可以了，taken 生成后如果不用参数"isref=1"刷新，就一直不会变。如果写死在您的软件中，就要慎用"isref=1"，否则您之前写死的软件都要改taken。

        # 开始识别
        # 获取文件二进制流
        result = Ks95man.PostPic('a.jpg', 1)
        print('识别结果：' + result)
        # 识别报错
        Ks95man.ReportError(88)
    # 模拟登录
    login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
    data = {
        '__VIEWSTATE': '3uK9eaUXyiZDe048G8d9KUZLVRTkC9BJiFsJawOYOuXppFLYhil5viqWYspVcZuittimiM+97bFsX9GRxBR13VessGfbfhhJxMyn08gr4PBxTeIeFg5d9tI5n10=',
        '__VIEWSTATEGENERATOR': 'C93BE1AE',
        'from': 'http://so.gushiwen.cn/user/collect.aspx',
        'email': 'npytig62015@chacuo.net',
        'pwd': 'npytig62015@chacuo.n',
        'code': result,
        'denglu': '登录'
    }
    # login_page_text = requests.post(url=login_url, headers=headers, data=data).text
    # with open('guishiwen.html', 'w', encoding='utf-8') as fp:
    #     fp.write(login_page_text)
    # 返回响应码
    # response = requests.post(url=login_url, headers=headers, data=data)
    # print(response.status_code)
    # 可以肯定，登录是成功的，无法获取登录后界面数据原因：再次跳转且进行新认证，如cookie
    # 查阅资料可以知道，确是使用 requests.post() 完成登陆后，立即关闭会话，销毁cookie
    # requests.Session().post 调用后保存会话，cookie仍旧存在
    login_session_text = session.post(url=login_url,headers=headers,data=data).text;
    with open('gushiwenwang.html', 'w', encoding='utf-8') as fp:
        fp.write(login_session_text);
    print("执行结束！！！")

Cookie相关

要点与答疑

http/https 无状态协议
发起登录请求提示验证码不正确：发起的登录请求并未基于获取验证码状态发起。
cookie反反爬机制
- 手动封装：headers = {‘cookie’: ‘***’}
- 自动获取（session会话）
  1. post 请求模拟登录后获得服务器返回的 cookie
  2. session会话对象：使用 session 发起请求时产生的 cookie 自动存入 session 对象中

session会话流程

创建对象：session = requests.session()
通过 session.post() 或 session.get() 的方式发起请求

实例

看古诗文网实战笔记

代理服务器

代理作用

反反爬策略–突破IP限制
隐藏真是IP

代理类型

http：http协议URL
https：https协议URL

代理匿名度

透明：服务器知道使用代理和真实IP
匿名：服务器知道使用代理，不知真实IP
高匿名：服务器不知道使用代理和真实IP

代理实例1

实话实说，IP还是没变

import requests

proxy = {
    'https': 'https://222.128.171.133:3128'
}

response = requests.get("http://httpbin.org/ip", proxies=proxy)
print(response.text)

代理实例2

这个更惨，同样的代理访问百度一直报错
别骂了别骂了，找过度娘了，可能度娘留了一手
不纠结，留下笔记，以后用到再改改

实锤了，要充钱，白嫖的基本没了

#!/usr/bin/env python3
# -*-coding:utf-8-*-
import time

import requests

if __name__ == "__main__":
    # session = requests.Session()
    # session.trust_env = False
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
    url = "https://www.baidu.com/s?ie=UTF-8&wd=ip"
    # 报错
    # requests.exceptions.SSLError: HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: /s?ie=UTF-8&wd=ip (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1076)')))
    # 再次报错
    # requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: /s?ie=UTF-8&wd=ip (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002426CB7C408>: Failed to establish a new connection: [WinError 10061] 由于目标计算机积极拒绝，无法连接。')))
    # i = 0
    # while True:
    #     try:
    #
    response = requests.get(url=url, headers=headers, proxies={'https': 'https://222.128.171.133:3128'}, verify=False)
        # except:
        #     i = i+1
        #     print("有问题", i)
        #     time.sleep(5)
        #     continue
    # print("admin" + response.status_code)
    # with open('./proxy.html', 'w', encoding='utf-8') as fp:
    #     fp.write(proxy_page_text)
    # print('瞅瞅这个./proxy.html文件')

声明

识别方式

云打码使用

登录古诗文网

简要概括

实战笔记

Cookie相关

要点与答疑

session会话流程

实例

代理服务器

代理作用

相关网站

代理类型

代理匿名度

代理实例1

代理实例2