[toc]
一、声明
1
| from bs4 import BeautifulSoup
|
1
| soup = BeautifulSoup(爬取内容,解释器)
|
二、基本元素
1.对BeautifulSoup库的理解
Beautifulsoup是解析、遍历、维护”标签书“的功能库
2.BeautifulSoup类
(1)原理
1 2 3
| flowchart LR HTML <--> 标签树 标签树 <--> BeautifulSoup类
|
1 2 3
| from bs4 import BeautifulSoup soup = BeautifulSoup("<html>data</html>","html.parser") soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")
|
(2)解析器
| 解析器 |
使用方法 |
条件 |
| bs4的HTML解析器 |
BeautifulSoup(mk,”html.parser”) |
安装bs4库 |
| lxml的HTML解析器 |
BeautifulSoup(mk,”lxml”) |
pip install lxml |
| lxml的XML解析器 |
BeautifulSoup(mk,”xml”) |
pip install xml |
| html5lib的解析器 |
BeautifulSoup(mk,”html5lib”) |
pip install html5lib |
(3)基本元素
| 基本元素 |
说明 |
| Tag |
标签,最基本元素,分别用<>和>标明开通与结尾 |
| Name |
标签的名字 |
| Attributes |
标签的属性,字典形式格式,字典形式组织,格式:.attrs |
| NavigableString |
标签内非属性字符串,<>……>中字符串 |
| Comment |
标签内字符串的注释部分 |
三、使用
1.载入
(1)通过字符串构建
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| html=''' <html lang="zh-cn"> <head> <meta charset="utf-8" /> </head>
<div id="main"> <span role="heading" aria-level="2">span</span> <h1>h1</h1> <p>p</p> </div> </body> </html> '''
|
1 2
| soup = BeautifulSoup(html,'html_parser') print(soup.prettify())
|
(2)从文件中加载
1 2
| with open('测试.html',encoding='utf-8') as f: soup = BeatuifulSoup(f,'html_parser')
|