Python 爬虫工具 BeautifulSoup

文章目录

1. BeautifulSoup 概述
- 1.1. 安装
2. 对象的种类
- 2.1. BeautifulSoup
- 2.2. NavigableString（字符串）
- 2.3. Comment
- 2.4. Tag
- - 2.4.1. 获取标签的名称
  - 2.4.2. 获取标签的属性
  - 2.4.3. 获取标签的内容
  - - 2.4.3.1. tag.string
    - 2.4.3.2. tag.strings
    - 2.4.3.3. tag.text
    - 2.4.3.4. tag.stripped_strings
  - 2.4.4. 嵌套选择
  - 2.4.5. 子节点、子孙节点
  - 2.4.6. 父节点、祖先节点
  - 2.4.7. 兄弟节点
3. 文档树搜索
- 3.1. find_all（查找多个）
- - 3.1.1. name 参数
  - - 3.1.1.1. 字符串（根据标签名搜索）
    - 3.1.1.1. 正则表达式
    - 3.1.1.1. 列表
    - 3.1.1.1. 方法
    - 3.1.1.1. True
  - 3.1.2. keyword 参数（根据属性值搜索）
  - 3.1.3. string 参数（根据内容搜索标签）
  - 3.1.4. limit 参数
  - 3.1.5. recursive 参数
- 3.2. find（查找单个）
- 3.3. find_parents() 和 find_parent()
- 3.4. find_next_siblings() 和 find_next_sibling()
- 3.5. find_all_next() 和 find_next()

1. BeautifulSoup 概述

简单来说，Beautiful Soup 是 python 的一个库，最主要的功能是从网页抓取数据。官方解释如下：
Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。
它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。
参考：
https://developer.aliyun.com/article/1632482
https://www.cnblogs.com/banchengyanyu/articles/18122650

1.1. 安装

pip install beautifulsoup4

2. 对象的种类

Beautiful Soup 将复杂 HTML 文档转换成一个复杂的树形结构，每个节点都是 Python 对象，所有对象可以归纳为4种：
BeautifulSoup，NavigableString，Comment，Tag。

2.1. BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性。

print(type(soup.name))
# <class 'str'>
print(soup.name)
# [document]
print(soup.attrs)
# {} 空字典

2.2. NavigableString（字符串）

字符串常被包含在 Tag 内，Beautiful Soup 用 NavigableString 类来包装 Tag 中的字符串。

tag.string
# 'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

2.3. Comment

如果字符串内容为注释，则为 Comment。

html_doc='<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>'

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.a.string)   # Elsie
print(type(soup.a.string))  #  <class 'bs4.element.Comment'>

a 标签里的内容实际上是注释，但是如果我们利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了，所以这可能会给我们带来不必要的麻烦。

2.4. Tag

通俗点讲就是 HTML 中的一个个标签，Tag 对象与 XML 或 HTML 原生文档中的 tag 相同：

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')
tag = soup.b
print(type(tag))			# <class 'bs4.element.Tag'>

Tag 有很多方法和属性,现在介绍一下tag中最重要的属性: name 和 attributes

2.4.1. 获取标签的名称

使用 tag.name 属性可以获取当前标签的名称。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')
tag = soup.b
print(tag.name)				# b

2.4.2. 获取标签的属性

使用 tag.attrs 属性可以获取当前标签的属性字典。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')
tag = soup.b
print(tag.attrs)			# {'class': ['boldest']}

2.4.3. 获取标签的内容

2.4.3.1. tag.string

使用 tag.string 属性可以获取当前标签内的文本内容。
如果标签内只有一个字符串，可以直接使用该属性获取内容。

# - 如果标签内只有一个字符串，可以直接使用该属性获取内容。
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')
tag = soup.b
print(tag.string)			# Extremely bold

2.4.3.2. tag.strings

使用 tag.strings 方法可以获取当前标签内所有子节点的文本内容，返回一个生成器对象。

soup = BeautifulSoup('<div><b class="boldest">Extremely bold1;</b><b class="boldest">Extremely bold2.</b></div>', 'lxml')
tag = soup.div
print(tag.strings)			# <generator object Tag._all_strings at 0x0000015C50110BA0>
print(list(tag.strings))	# ['Extremely bold1;', 'Extremely bold2.']

2.4.3.3. tag.text

使用 tag.text 属性可以获取当前标签内所有子节点的文本内容，并将其连接在一起。

soup = BeautifulSoup('<div><b class="boldest">Extremely bold1;</b><b class="boldest">Extremely bold2.</b></div>', 'lxml')
tag = soup.div
print(tag.text)				# Extremely bold1;Extremely bold2.

2.4.3.4. tag.stripped_strings

使用 tag.stripped_strings 方法可以获取当前标签内所有子节点的文本内容，并去掉多余的空白字符。
该方法返回一个生成器对象。例如，遍历输出所有标签内的文本内容：

soup = BeautifulSoup('<div><b class="boldest">Extremely bold1;</b><b class="boldest">Extremely bold2.</b></div>', 'lxml')
tag = soup.div
for line in soup.stripped_strings:
    print(line)
# Extremely bold1;
# Extremely bold2.

2.4.4. 嵌套选择

嵌套选择可以通过访问父子节点的方式来获取特定标签的文本内容。
在给定的示例中，我们使用 text 属性来访问特定标签的文本内容。

soup = BeautifulSoup('<div><b class="boldest">Extremely bold1;</b><b class="boldest">Extremely bold2.</b></div>', 'lxml')
print(soup.div.b.text)		# Extremely bold1;

2.4.5. 子节点、子孙节点

在 BeautifulSoup 中，可以通过 .contents 和 .children 属性来获取标签的子节点。
.contents 属性返回一个包含所有子节点的列表，
.children 属性返回一个迭代器，可以逐个访问子节点,
.descendants 属性返回一个迭代器，可以获取子孙节点。

soup = BeautifulSoup('<div><b class="boldest">Extremely bold1;</b><b class="boldest">Extremely bold2.</b></div>', 'lxml')
tag = soup.div

print("div下所有子节点")
print(type(tag.contents), tag.contents)
# div下所有子节点
# <class 'list'> [<b class="boldest">Extremely bold1;</b>, <b class="boldest">Extremely bold2.</b>]

print("得到一个迭代器，包含div下所有子节点")
print(type(tag.children), tag.children)
for child in tag.children:
    print(type(child), child)
# 得到一个迭代器，包含div下所有子节点
# <class 'generator'> <generator object Tag.children.<locals>.<genexpr> at 0x0000026B8DA00C80>
# <class 'bs4.element.Tag'> <b class="boldest">Extremely bold1;</b>
# <class 'bs4.element.Tag'> <b class="boldest">Extremely bold2.</b>

print("得到一个迭代器，包含div下所有子孙节点")
print(type(tag.descendants), tag.descendants)
for child in tag.descendants:
    print(type(child), child)
# 得到一个迭代器，包含div下所有子孙节点
# <class 'generator'> <generator object Tag.descendants at 0x0000026B8DA00C80>
# <class 'bs4.element.Tag'> <b class="boldest">Extremely bold1;</b>
# <class 'bs4.element.NavigableString'> Extremely bold1;
# <class 'bs4.element.Tag'> <b class="boldest">Extremely bold2.</b>
# <class 'bs4.element.NavigableString'> Extremely bold2.

2.4.6. 父节点、祖先节点

.parent 属性可以获取标签的父节点，
.parents 属性则可以获取标签的所有祖先节点，从父亲的父亲开始一直到最顶层的祖先节点。

soup = BeautifulSoup('<div><b class="boldest">Extremely bold1;</b><b class="boldest">Extremely bold2.</b></div>', 'lxml')
tag = soup.div.b 
print(type(tag.parent), tag.parent)		# <class 'bs4.element.Tag'> <div><b class="boldest">Extremely bold1;</b><b class="boldest">Extremely bold2.</b></div>

print(type(tag.parents), tag.parents)	# <class 'generator'> <generator object PageElement.parents at 0x00000255E6380C10>

2.4.7. 兄弟节点

.next_sibling 属性返回下一个兄弟节点，
.previous_sibling 属性返回上一个兄弟节点，
.next_siblings 属性返回一个生成器对象，可以逐个访问后面的兄弟节点。

soup = BeautifulSoup('<div><b class="boldest">Extremely bold1;</b><b class="boldest">Extremely bold2.</b></div>', 'lxml')
tag = soup.div.b 
print(type(tag.next_sibling), tag.next_sibling)			# <class 'bs4.element.Tag'> <b class="boldest">Extremely bold2.</b>
print(type(tag.next_siblings), tag.next_siblings)		# <class 'generator'> <generator object PageElement.next_siblings at 0x000001E0E2350BA0>
print(type(tag.previous_sibling), tag.previous_sibling)	# <class 'NoneType'> None

3. 文档树搜索

recursive 是否从当前位置递归往下查询，如果不递归，只会查询当前 soup 文档的子元素
string 这里是通过 tag 的内容来搜索，并且返回的是类容，而不是 tag 类型的元素
**kwargs 自动拆包接受属性值，所以才会有 soup.find_all(‘a’,id=‘title’) ，id='title’为 **kwargs 自动拆包掺入
BeautifulSoup 定义了很多搜索方法，这里着重介绍2个：find() 和 find_all()

3.1. find_all（查找多个）

语法：

find_all(name, attrs, recursive, string, **kwargs)

name: 指定要查找的 tag 名称，可以是字符串或正则表达式。
attrs: 指定 tag 的属性，可以是字典或字典的列表。
recursive: 指定是否递归查找子孙 tag，默认为 True。
string: 指定查找的文本内容，可以是字符串或正则表达式。

3.1.1. name 参数

name 五种过滤器: 字符串、正则表达式、列表、方法、True

3.1.1.1. 字符串（根据标签名搜索）

传入标签名

from bs4 import BeautifulSoup

html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="first_p"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

    <p class="story">...</p>
    """

soup = BeautifulSoup(html_doc, 'lxml')

#  `soup.find_all(name='a')` 将返回所有的 `<a>` 标签。
tags = soup.find_all(name='a')
print(type(tags), tags)			# <class 'bs4.element.ResultSet'>  [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
for tag in tags:
    print(tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

3.1.1.1. 正则表达式

可以使用正则表达式来匹配标签名。

# 找出 b 开头的标签，结果有 body 和 b 标签
import re
tags = soup.find_all(name=re.compile('^b'))
print(type(tags), tags)

# <class 'bs4.element.ResultSet'> [<body>
# <p class="title" name="first_p"><b>The Dormouse's story</b></p>
# <p class="story">Once upon a time there were three little sisters; and their names were
#     <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#     <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#     <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#     and they lived at the bottom of a well.</p>
# <p class="story">...</p>
# </body>, <b>The Dormouse's story</b>]

3.1.1.1. 列表

如果传入一个列表参数，Beautiful Soup 会返回与列表中任何元素匹配的内容。
例如 soup.find_all(name=[‘a’, ‘b’]) 将返回文档中所有的标签和 标签。

#  `soup.find_all(name=['a', 'b'])` 将返回文档中所有的 `<a>` 标签和 `<b>` 标签
tags = soup.find_all(name=['a', 'b'])
print(type(tags), tags)		# <class 'bs4.element.ResultSet'> [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3.1.1.1. 方法

如果没有合适的过滤器，可以定义一个方法来匹配元素。
这个方法只接受一个元素参数，如果方法返回 True 表示当前元素匹配并被找到，否则返回 False。

# 只返回具有 class 属性而没有 id 属性的 标签
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

tags = soup.find_all(name=has_class_but_no_id)
print(type(tags), tags)
# <class 'bs4.element.ResultSet'> [<p class="title" name="first_p"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
#     <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#     <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#     <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#     and they lived at the bottom of a well.</p>, <p class="story">...</p>]

3.1.1.1. True

通过 find_all(True) 可以匹配所有的 tag，不会返回字符串节点。
在代码中，会使用循环打印出每个匹配到的tag的名称(tag.name)。

tags = soup.find_all(True)
for tag in tags:
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

3.1.2. keyword 参数（根据属性值搜索）

keyword 参数用于按照属性值进行搜索。
如果一个指定名字的参数不是内置的参数名，Beautiful Soup 会将其当作指定名字的属性来搜索。
例如：包含 href 的参数将搜索每个 tag 的 href 属性。
指定属性值：
例如 soup.find_all(href=“http://example.com/tillie”) 返回所有 href 属性等于 “http://example.com/tillie” 的标签。
正则表达式匹配属性值：
例如 soup.find_all(href=re.compile(“^http://”)) 返回所有 href 属性以 “http://” 开头的标签。
多个属性：
例如 soup.find_all(href=re.compile(“http://”), id=‘link1’) 返回同时满足 href 以 “http://” 开头并且 id 等于 “link1” 的标签。

# 返回所有 `href` 属性等于 "http://example.com/tillie" 的标签。
tags = soup.find_all(href="http://example.com/tillie")
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 返回所有 `href` 属性以 "http://" 开头的标签。
tags = soup.find_all(href=re.compile("^http://"))
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 拥有 id 属性的 tag
tags = soup.find_all(id=True)
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 多个属性
tags = soup.find_all(href=re.compile("http://"), id='link1')
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# 注意，class 是 Python 的关键字，所以 class 属性用 class_
tags = soup.find_all("a", class_="sister")
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tags = soup.find_all("a", attrs={"href": re.compile("^http://"), "id": re.compile("^link[12]")})
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

# 通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:
tags = soup.find_all(attrs={"data-foo": "value"})
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> []

3.1.3. string 参数（根据内容搜索标签）

string（旧版为text）参数用于根据内容搜索标签。可以接受字符串、列表或正则表达式。
字符串：
返回包含指定内容的标签。
例如 soup.find_all(string=“Elsie”) 返回所有包含文本 “Elsie” 的标签。
列表：
返回包含列表中任一元素内容的标签。
例如 soup.find_all(string=[“Tillie”, “Elsie”, “Lacie”]) 返回所有包含文本 “Tillie”、“Elsie” 或 “Lacie” 的标签。
正则表达式：
使用正则表达式来匹配内容。
例如 soup.find_all(string=re.compile(“Dormouse”)) 返回所有包含文本中包含 “Dormouse” 的标签。

# 返回所有包含文本 "Elsie" 的标签
tags = soup.find_all(string="Elsie")
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> ['Elsie']

# 返回所有包含文本 "Tillie"、"Elsie" 或 "Lacie" 的标签。
tags = soup.find_all(string=["Tillie", "Elsie", "Lacie"])
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> ['Elsie', 'Lacie', 'Tillie']

# 返回所有包含文本中包含 "Dormouse" 的标签。
# 只要包含Dormouse就可以
tags = soup.find_all(string=re.compile("Dormouse"))
print(type(tags), tags)			# <class 'bs4.element.ResultSet'> ["The Dormouse's story", "The Dormouse's story"]

3.1.4. limit 参数

find_all() 方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。
如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量，效果与SQL中的limit关键字类似。当搜索到的结果数量达到 limit 的限制时，就停止搜索返回结果。
例如 soup.find_all(“a”, limit=2) 返回前两个标签。

tags = soup.find_all("a")
print(type(tags), len(tags), tags)	# <class 'bs4.element.ResultSet'> 3 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tags = soup.find_all("a", limit=2)
print(type(tags), len(tags), tags)	# <class 'bs4.element.ResultSet'> 2 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

3.1.5. recursive 参数

recursive 参数用于控制是否递归往下查询。
默认情况下，Beautiful Soup 会检索当前 tag 的所有子孙节点。如果想要仅搜索 tag 的直接子节点，可以设置 recursive=False。
例如 soup.find_all(“div”, recursive=False) 只会查找当前 soup 文档的直接子元素中的

标签。

from bs4 import BeautifulSoup

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <div>
    <p class="title" name="first_p"><b>The Dormouse's story</b></p>
    </div>
    <div>
        <div>
        ...
        </div>
    </div>
</body>
</html>
    """

soup = BeautifulSoup(html_doc, 'lxml')

# 只会查找当前soup文档的直接子元素中的 `<div>` 标签。
# print(soup.body)
tags = soup.body.find_all("div")
print(type(tags), len(tags))
print(tags)
# <class 'bs4.element.ResultSet'> 3 
# [<div>
# <p class="title" name="first_p"><b>The Dormouse's story</b></p>
# </div>, <div>
# <div>
#         ...
#         </div>
# </div>, <div>
#         ...
#         </div>]

tags = soup.body.find_all("div", recursive=False)
print(type(tags), len(tags))
print(tags)
# <class 'bs4.element.ResultSet'> 2 
# [<div>
# <p class="title" name="first_p"><b>The Dormouse's story</b></p>
# </div>, <div>
# <div>
#         ...
#         </div>
# </div>]

3.2. find（查找单个）

find() 方法用于在文档中查找符合条件的tag，并返回第一个匹配的结果。
它可以通过指定name、attrs、recursive和string等参数来过滤查找结果。

find(name, attrs, recursive, string, **kwargs)

find_all() 拿到的是列表，find() 拿到的是本身。
find_all() 方法将返回文档中符合条件的所有tag，尽管有时候我们只想得到一个结果
比如文档中只有一个标签
使用 find_all() 方法来查找标签就不太合适
使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法
下面两行代码是等价的:

tags = soup.find_all('title', limit=1)
print(type(tags), len(tags))
# <class 'bs4.element.ResultSet'> 1

tags = soup.find('title')
print(type(tags), len(tags))
<class 'bs4.element.Tag'> 1

3.3. find_parents() 和 find_parent()

find_parents() 和 find_parent() 方法用于查找当前 tag 的父级 tag。
find_parents():
返回所有符合条件的父级 tag，结果是一个生成器。
可以传入参数来进一步筛选父级 tag。
find_parent():
返回第一个符合条件的父级 tag。

3.4. find_next_siblings() 和 find_next_sibling()

find_next_siblings() 和 find_next_sibling() 方法用于查找当前 tag 后面的兄弟 tag。
find_next_siblings():
返回所有符合条件的后续兄弟 tag，结果是一个列表。
可以传入参数来进一步筛选兄弟 tag。
find_next_sibling():
返回第一个符合条件的后续兄弟 tag。

3.5. find_all_next() 和 find_next()

find_all_next() 和 find_next() 方法用于在当前 tag 之后查找符合条件的 tag 和字符串。
find_all_next():
返回所有符合条件的后续 tag 和文本内容，结果是一个生成器。
可以传入参数来进一步筛选结果。
find_next():
返回第一个符合条件的后续 tag 或文本内容。