声明
本教程仅供学习参考,请勿用在非法途径上,违者后果自负,与笔者无关。 –涂寐
聚焦爬虫
编码流程
- 指定url
- 发起请求
- 数据解析
- 持久化存储
方法分类
- 正则表达式
- bs4解析
- xpath解析
简述使用
- 需求内容在标签间或作为标签的属性存储
- 标签定位
- 从标签间或标签属性值中提取所需
正则表达式
网页源代码
1 2 3 4 5
| <div class="thumb"> <a href="/article/124982889" target="_blank"> <img src="//pic.qiushibaike.com/system/pictures/12498/124982889/medium/B39EVD457VB64VZH.jpg" alt="糗事#124982889" class="illustration" width="100%" height="auto"> </a> </div>
|
取src正则
1
| ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
|
糗事百科糗图
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
|
import os.path import re
import requests
if __name__ == "__main__": if os.path.exists('./qiutuLibs') is False: os.mkdir('./qiutuLibs') headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' }
url = 'https://www.qiushibaike.com/imgrank/page/%d/' for pageNum in range(1, 2): new_url = '{}{}'.format(url, pageNum) print(new_url) page_text = requests.get(url=new_url, headers=headers).text ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>' img_src_list = re.findall(ex, page_text, re.S) for src in img_src_list: src = 'https:' + src img_data = requests.get(url=src, headers=headers).content img_name = src.split('/')[-1] imgPath = './qiutuLibs/' + img_name with open(imgPath, 'wb') as fp: fp.write(img_data) print(img_name, "下载成功") fp.close()
|
BS4解析
简述原理
- 实例化一个BeautifulSoup对象,将页面源码数据加载到该对象中
- 调用BeautifulSoup对象中的属性和方法进行标签定位和数据提取
环境安装
1 2 3 4
| php install bs4
pip install lxml
|
使用概要
from bs4 import BeautifulSoup
Beautiful Soup对象实例化
将本地html文档数据加载到BeautifulSoup对象
1 2
| fp = open('./localWeb.html', 'r', encoding='utf-8') soup = BeautifulSoup(fp, 'lxml')
|
或,将门户网站拉取的页面源码加载到BeautifulSoup对象
1 2
| page_text = response.text soup = BeautifulSoup(page_text, 'lxml')
|
相关属性和方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| # 本次测试HTML为笔者博客首页网页源码
soup.tagName: 返回文档中第一次出现的 tagName 标签 soup.tagName['PropertyName']:提取 tagName 标签中属性名为 PropertyName 的值 soup.find('tagName'):返回文档中第一次出现的 tagName 标签 soup.find('a', class_="active"):根据属性再次定位 soup.find_all('tageName'):返回所有 tageName 标签的列表 soup.select('.selectorName'):根据 id/class 等选择器返回对应列表 soup.select('.header-drawer > ul > li >a'):层级选择器,一个 > 表示一个层级 soup.select('.header-drawer > ul a'):一个 空格 表示多个层级 soup.select('.header-drawer > ul a')[2]:如数组,通过下标选择列表中的某个,此处选择位序为 3 的 a 标签 soup.select('.header-drawer > ul a')[1].text:text 属性拿到标签间所有文本内容 soup.select('.header-drawer > ul a')[3].string:string属性拿到标签间直系文本内容,请通过 find() 方法测试 soup.select('.header-drawer > ul a')[4].get_text():get_text() 方法拿到标签间所有文本内容
|
本地测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
from bs4 import BeautifulSoup
if __name__ == "__main__": fp = open('./localWeb.html', 'r', encoding='utf-8') soup = BeautifulSoup(fp, 'lxml') print(soup.a['href'])
|
三国演义小说
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
|
import requests from bs4 import BeautifulSoup
if __name__ == "__main__": headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36' } url = 'https://www.shicimingju.com/book/sanguoyanyi.html' page_text = requests.get(url=url, headers=headers).text.encode('ISO-8859-1') print(page_text) soup = BeautifulSoup(page_text, 'lxml') li_list = soup.select('.book-mulu > ul > li') fp = open('./sanguo.txt', 'w', encoding='utf-8') for li in li_list: title = li.a.string detail_url = 'https://www.shicimingju.com' + li.a['href'] detail_page_text = requests.get(url=detail_url, headers=headers, ).text.encode('ISO-8859-1').decode('utf-8') detail_soup = BeautifulSoup(detail_page_text, 'lxml') div_tag = detail_soup.find('div', class_='chapter_content') content = div_tag.get_text() fp.write(title + ':' + content + '\n') print(title, '爬取成功!!!')
|
xpath解析
原理
- 实例化etree对象,将待解析页面源码加载到其中
- 调用etree对象得xpath方法,结合xpath表达式实现标签定位和内容捕获
环境
使用举例
- 本地:etree.parse(filePath)
- 网络:etree.HTML(‘page_text’)
xpath表达式
1 2 3 4 5 6
| /:表示一个层级,从html标签(根标签)开始定位 //:表示多个层级,可从任意位置开始定位 tagName[@class="PropertyName"]:属性定位,精准定位 tagName[@class="PropertyName"]/tageName[1]:索引定位,索引以1为始 tagName[@class="PropertyName"]/tageName/text():/text()方法取直系标签文本内容,//text()取非直系标签(其下所有)文本内容,返回列表,可通过下标选择某个列表值 tagName[@class="PropertyName"]/tagName/@PropertyName:通过@标签中属性来提取属性值
|
本地测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
|
from lxml import etree
if __name__ == "__main__": parser = etree.HTMLParser(encoding='utf-8') tree = etree.parse('localWeb.html', parser=parser) r = tree.xpath('//div[@class="s-icon-list"]/span[1]') r = tree.xpath('//div[@class="header-drawer"]//li//text()') print(r)
|
58二手房
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
|
import requests from lxml import etree
if __name__ == "__main__": headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53' } url = 'https://bj.58.com/ershoufang/' page_text = requests.get(url=url, headers=headers).text tree = etree.HTML(page_text) div_list = tree.xpath('//section[@class="list"]/div') fp = open('./58.txt', 'w', encoding='utf-8') for div in div_list: title = div.xpath('./a/div[2]//h3/text()')[0] fp.write(title + '\n')
|
彼岸图网图片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
|
import requests from lxml import etree import os
if __name__ == "__main__": if os.path.exists('./biantuwang') is False: os.mkdir('./biantuwang') headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53' } url = 'https://pic.netbian.com/4kdongman/' response = requests.get(url=url, headers=headers) page_text = response.text tree = etree.HTML(page_text) li_list = tree.xpath('//div[@class="slist"]/ul/li') for li in li_list: img_src = 'https://pic.netbian.com' + li.xpath('./a/img/@src')[0] img_name = li.xpath('./a/img/@alt')[0] + '.jpg' img_name = img_name.encode('ISO-8859-1').decode('gbk') img_data = requests.get(url=img_src, headers=headers).content img_path = './biantuwang/' + img_name with open(img_path, 'wb') as fp: fp.write(img_data) fp.close() print(img_name + '下载成功!!!')
|
空气检测平台城市
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
|
import requests from lxml import etree
if __name__ == "__main__": headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53' } url = 'https://www.aqistudy.cn/historydata/' page_text = requests.get(url=url, headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//div[@class="bottom"]/ul//li') all_city_names = [] fp = open('./citys.txt', 'w', encoding='utf-8') for li in li_list: city_name = li.xpath('./a/text()')[0] fp.write(city_name + '\n') all_city_names.append(city_name) print(city_name) fp.close()
|
站长之家免费简历
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
|
import requests import os from lxml import etree
if __name__ == "__main__": if os.path.exists('./jianli') is False: os.mkdir('./jianli') headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36 Edg/96.0.1054.53' } url = 'https://sc.chinaz.com/jianli/free_{num}.html' for x in range(29, 30): url = url.format(num=x) page_text = requests.get(url=url, headers=headers).text tree = etree.HTML(page_text) div_list = tree.xpath('//div[@id="main"]/div/div') for div in div_list: a_href = 'https:' + div.xpath('./a/@href')[0] a_href_text = requests.get(url=a_href, headers=headers).text.encode('ISO-8859-1') a_tree = etree.HTML(a_href_text) a_src = a_tree.xpath('//div[@class="down_wrap"]/div[2]/ul/li[3]/a/@href')[0] a_name = a_tree.xpath('//div[@class="bgwhite"]/div/h1/text()')[0].strip() + '.rar' rar_page = requests.get(url=a_src, headers=headers).content with open('./jianli/' + a_name, 'wb') as fp: fp.write(rar_page) fp.close() print(a_name + '-->下载完毕!!!') print("\n当前拉取到第{}页".format(x)) print('\n全部简历拉取结束!!!')
|