前言 实验代码不想按分类处理,单独作为一个模块,其中笔记较多,对着代码更好理解
目的 异步方式 多线程|多进程 优点:为阻塞操作单独开启线程|进程,实现异步 劣势:不能无限制开启多线程|多进程 线程池|进程池 优点:降低系统对进程|进程创建销毁的频率,进而降低系统内存开销 劣势:池中线程数|进程数有限 单线程 + 异步协程 even_loop:事件循环对象。其中可写入待循环方法(协程对象)。 coroutine:协程对象。使用 async 关键字定义方法,该方法被调用时返回一个协程对象,而不是调用后立即执行;协程对象被事件循环调用。 task:任务对象。对协程对象再次封装,内含任务各状态。 future:将执行或未执行的任务,与 task 无本质差别。 async:定义一个协程。 await:用来挂起阻塞方法的执行。 selenium 模块使用 selenium 模块和爬虫间关联? 什么是 selenium 模块? selenium 模块使用步骤 环境安装:pip install selenium 浏览器驱动程序下载 实例化浏览器对象from selenium import webdriver from selenium.webdriver.chrome.service import Service service = Service(executable_path=’./chromedriver.exe’) driver = webdriver.Chrome(service=service) 编写基于浏览器自动化的操作代码发起请求:get(URL)) 标签定位:find_element() 结合 By 类 标签交互:send_keys(‘xxx’) 执行 js 脚本:excute_script(‘js代码’) 界面回退:back() 界面前进:forward() 关闭浏览器:quit() selenium 模块处理 iframe待定位标签位于 iframe 标签中,使用 switch_to.frame() 方法改变定位作用域 动作链,需导入:from selenium.webdriver import ActionChains长按点击动作:click_and_hold(draggable_id) 拖动操作:move_by_offset(x, y) 立即执行操作:perform() 释放动作链:release() options 类相关方法binary_location:设置 Chrome 二进制文件位置 add_argument:添加启动参数 add_extension,add_encoded_extension:添加扩展应用 add_experimental_option:添加实验性质的设置参数 debugger_address:设置调试器地址 代码实例 以下部分为代码实例讲解部分,内含单条语句注释说明
同步爬虫测试 首先,感受一下同步爬虫
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 import requestsheaders = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62' } urls = [ 'https://downsc.chinaz.net/Files/DownLoad/jianli/202112/jianli16623.rar' , 'https://downsc.chinaz.net/Files/DownLoad/jianli/202112/jianli16611.rar' , 'https://downsc.chinaz.net/Files/DownLoad/jianli/202112/jianli16615.rar' ] class RequestGet (object ): def get_content (self, url ): print ('正在爬取:' + url) response = requests.get(url=url, headers=headers) if response.status_code == 200 : return response.content def parse_content (self, content ): print ('数据长度:' , len (content)) if __name__ == "__main__" : requestGet = RequestGet() for url in urls: content = requestGet.get_content(url) requestGet.parse_content(content)
线程池的基本使用 主要进行同步爬虫、单线程串行、线程池的使用效果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import timefrom multiprocessing.dummy import Pooldef get_page (str ): print ("正在下载" + str ) time.sleep(2 ) print ('下载成功' + str ) name_list = ['xiaozi' , 'aa' ,'bb' ,'cc' ] start_time = time.time() pool = Pool(4 ) pool.map (get_page, name_list) end_time = time.time() print ('%d second' %(end_time - start_time))
线程池爬取梨视频 梨视频网站现在通过 ajax 实时加载视频,且改变了 ajax 响应包的视频 URL,具体看下面的代码 再bb一句,下载量大时对比线程池耗时效果更明显
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 import reimport timefrom multiprocessing.dummy import Poolimport requestsfrom lxml import etreeclass Video (object ): headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62' } urls = [] def video_url (self ): url = 'https://www.pearvideo.com/category_130' page_text = requests.get(url=url, headers=self.headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//ul[@id="categoryList"]/li' ) for li in li_list: detail_url = 'https://www.pearvideo.com/' + li.xpath('./div/a/@href' )[0 ] name = li.xpath('./div/a/div[2]/text()' )[0 ] + '.mp4' print (name + " --> " + detail_url) ex = 'video_(.*)' video_id = re.findall(ex, detail_url)[0 ] ajax_url = 'https://www.pearvideo.com/videoStatus.jsp?contId={false_id}&mrd=0.6923628803380188' .format ( false_id=video_id) headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62' , 'Referer' : detail_url } ajax_page_json = requests.get(url=ajax_url, headers=headers).json() ea = "', 'srcUrl': '.*?(-.*?)'}}}" eb = ", 'srcUrl': '(.*/)" video_url_a = re.findall(ea, str (ajax_page_json))[0 ] video_url_b = re.findall(eb, str (ajax_page_json))[0 ] video_url = video_url_b + 'cont-' + video_id + video_url_a print (video_url) video_dict = { 'name' : name, 'url' : video_url } urls.append(video_dict) return urls def video_data (self, dict ): url = dict ['url' ] time.sleep(19 ) data = requests.get(url=url, headers=self.headers).content print (dict ['name' ] + '下载中……' ) with open ('./梨视频/' + dict ['name' ], 'wb' ) as fp: fp.write(data) print (dict ['name' ] + '下载完成!!!' ) if __name__ == '__main__' : video = Video() urls = [] urls = video.video_url() start = time.time() pool = Pool(4 ) pool.map (video.video_data, urls) end = time.time() print ('全部下载完成……' ) print (end - start)
单任务协程 要理解难的,肯定先理解其下较为简单的,别说了,看源码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 import asyncioasync def request (url ): print ('正在请求的url为:' + url) print ('请求成功:' + url) return url c = request('0xtlu.me' ) def callback_func (task ): print (task.result()) loop = asyncio.get_event_loop() task = asyncio.ensure_future(c) task.add_done_callback(callback_func) loop.run_until_complete(task)
多任务协程 单任务学完了,看多任务呗
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import asyncioimport timeasync def request (url ): print ('正在下载' + url) await asyncio.sleep(2 ) print ('下载完毕' + url) urls = ['www.baidu.com' , '0xtlu.me' , 'www.google.com' , 'www.sogou.com' ] tasks = [] start = time.time() for url in urls: c = request(url) task = asyncio.ensure_future(c) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) print (time.time() - start)
多任务协程实验 这是一个难过的例子,有同步模块,看不到多任务异步协程效果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import asyncioimport timeimport requestsstart = time.time() urls = ['https://0xtlu.me/article/c2461216.html' , 'https://0xtlu.me/article/82b683e3.html' , 'https://0xtlu.me/article/c86c0afc.html' ] async def get_page (url ): print ('正在下载' + url) response = requests.get(url=url) print ('下载完毕' + response.text) tasks = [] for url in urls: c = get_page(url) task = asyncio.ensure_future(c) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print ('耗时:' + str (end - start))
aiohttp实现多任务异步协程 上个题目的解决方案:通过 aiohttp 模块解决同步模块 requests,让异步协程正确运行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 import asyncioimport timeimport aiohttpimport requestsstart = time.time() urls = ['https://0xtlu.me/article/c2461216.html' , 'https://0xtlu.me/article/82b683e3.html' , 'https://0xtlu.me/article/c86c0afc.html' ] async def get_page (url ): print ('正在下载' + url) async with aiohttp.ClientSession() as session: async with await session.get(url) as response: page_text = await response.text() print ('下载完毕' + page_text) tasks = [] for url in urls: c = get_page(url) task = asyncio.ensure_future(c) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print ('耗时:' + str (end - start))
selenium模块基本使用 之前看到过谷歌自动化的效果,现在使用 selenium 模块发现原来是这个家伙
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 from selenium import webdriverfrom selenium.webdriver.chrome.service import Servicefrom lxml import etreefrom time import sleepservice = Service(executable_path='./chromedriver.exe' ) driver = webdriver.Chrome(service=service) driver.get('http://scxk.nmpa.gov.cn:81/xk/' ) page_text = driver.page_source tree = etree.HTML(page_text) li_list = tree.xpath('//ul[@id="gzlist"]/li' ) for li in li_list: name = li.xpath('./dl/@title' )[0 ] print (name) sleep(6 ) driver.close()
selenium模块其他使用 上一个标题是 selenium 模块的基本使用,现在看下它的元素定位、数据交互、前进后退等功能
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 from time import sleepfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Serviceservice = Service(executable_path='./chromedriver.exe' ) driver = webdriver.Chrome(service=service) driver.get('https://www.taobao.com/' ) q_id = driver.find_element(By.ID, 'q' ) q_id.send_keys('iphone' ) button = driver.find_element(By.CSS_SELECTOR, '.btn-search' ) button.click() driver.execute_script('window.scrollTo(0,document.body.scrollHeight)' ) driver.get('https://0xtlu.github.io' ) sleep(2 ) driver.back() sleep(2 ) driver.forward() sleep(6 ) driver.quit()
selenium模块__iframe__动作链 登陆界面一般都是存在验证的,通常是通过 iframe 标签实现 当切换到 iframe 标签作用域时,需要实现一系列的动作——动作链的使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 from time import sleepfrom selenium.webdriver import ActionChainsfrom selenium.webdriver.chrome.service import Servicefrom selenium import webdriverfrom selenium.webdriver.common.by import Byservice = Service(executable_path='./chromedriver.exe' ) driver = webdriver.Chrome(service=service) driver.get('https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable' ) driver.switch_to.frame('iframeResult' ) draggable_id = driver.find_element(By.ID, 'draggable' ) action = ActionChains(driver) action.click_and_hold(draggable_id) for i in range (5 ): action.move_by_offset(50 , 0 ).perform() sleep(0.3 ) action.release().perform() sleep(3 ) action.reset_actions() driver.quit()
未实现登录QQ空间 登录界面没有改变,但滑块与缺口间位置对齐比较麻烦,暂时不想理解 对了,这里可以学到动作链的使用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 from selenium.webdriver.common.by import Byfrom selenium.webdriver.chrome.service import Servicefrom selenium import webdriverfrom time import sleepfrom selenium.webdriver import ActionChainsservice = Service(executable_path='./chromedriver.exe' ) driver = webdriver.Chrome(service=service) driver.get('https://qzone.qq.com/' ) driver.switch_to.frame('login_frame' ) a_tag = driver.find_element(By.ID, 'switcher_plogin' ) a_tag.click() username = driver.find_element(By.ID, 'u' ) password = driver.find_element(By.ID, 'p' ) username.clear() password.clear() username.send_keys('yourUser' ) password.send_keys('yourPasswd!' ) login = driver.find_element(By.ID, 'login_button' ) sleep(2 ) seli = login.click() print (seli)sleep(6 ) driver.switch_to.frame('tcaptcha_iframe' ) slideBlock_id = driver.find_element(By.ID, 'slideBlock' ) action = ActionChains(driver) action.click_and_hold(slideBlock_id) action.move_by_offset(183 , 0 ).perform() sleep(3 ) action.release().perform() sleep(36 ) driver.quit()
无可视化界面与规避反反爬 主要通过设置 chrome 的启动参数实现无可视化界面(无头浏览器) 反反爬机制主要为屏蔽自动化的某些特征,如自动化控制提示、禁止启用 blink 功能和嵌入js等等
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 from selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver import ChromeOptionsfrom selenium.webdriver.chrome.service import Servicefrom selenium import webdriverfrom time import sleepoption = Options() option.add_argument('--headless' ) option.add_argument('--disable-gpu' ) option.add_argument( 'user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"' ) option.add_experimental_option('excludeSwitches' , ['enable-automation' , 'load-extension' ]) service = Service(executable_path='./chromedriver.exe' ) driver = webdriver.Chrome(service=service, options=option) driver.get('https://www.baidu.com/' ) print (driver.page_source)sleep(6 ) driver.quit()
登录12306 哒哒,这就是看规避效果的时候了。没有进行无可视化界面处理哦
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 from time import sleepfrom selenium.webdriver.chrome.service import Servicefrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver import ActionChainsfrom selenium.webdriver.chrome.options import Optionsoption = Options() option.add_experimental_option('excludeSwitches' , ['enable-automation' , 'load-extension' ]) option.add_argument('--disable-blink-features=AutomationControlled' ) option.add_experimental_option('useAutomationExtension' , False ) prefs = {} prefs["credentials_enable_service" ] = False prefs["profile.password_manager_enabled" ] = False option.add_experimental_option("prefs" , prefs) service = Service(executable_path='./chromedriver.exe' ) driver = webdriver.Chrome(service=service, options=option) driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument" , { "source" : """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ }) driver.get('https://kyfw.12306.cn/otn/resources/login.html' ) username = driver.find_element(By.ID, 'J-userName' ) password = driver.find_element(By.ID, 'J-password' ) sleep(1 ) username.clear() password.clear() sleep(2 ) username.send_keys('yourUser' ) sleep(1 ) password.send_keys('yourPasswd' ) sleep(2 ) login = driver.find_element(By.ID, 'J-login' ) sleep(2 ) login.click() sleep(2 ) slide = driver.find_element(By.ID, 'nc_1_n1z' ) sleep(2 ) print (slide)action = ActionChains(driver) action.click_and_hold(slide) action.move_by_offset(300 , 0 ).perform() action.release().perform() sleep(1 ) sleep(10 ) print (driver.page_source)action.reset_actions() driver.quit()
后记 写完了,写过一遍了,懂了吗?没有 为什么做笔记,方便之后的回顾呀