爬虫之beautifulsoup的使用

发表于 2018-01-28 | 分类于 Python3

今天安装的beautifulsoup以及相关代码的运行也是费了好一段时间

问题 beautifulsoup的安装：

在cmd.exe中输入

pip install beautifulsoup4
pip install lxml

问题：bs4模块的添加

file->settings……->project interpreter(右侧)->pip(双击)->输入bs4->点击下面的install

问题：运行的时候出现cannot import name ‘HTMLParseError’

找到Python35\Lib\site-packages\bs4\builder_htmlparser.py文件打开，把HTMLParseError注释掉就可以了

导入bs4库：

from bs4 import BeautifulSoup     //大小写一定要注意

相关属性：

（1）、创建beautifulsoup对象
        soup=BeautifulSoup(html,'lxml')
（2）、打印soup对象的内容
        print(soup.prettify())

四大对象种类：

tag：name标签的名字，或者是tag本身的name，attrs通常指一个标签的class id等

NavigableString获取标签内文字

beautifuls 表示的是文档内的全部内容

comment 找注释  如果结果输出的是comment test表示这是一个注释

遍历文档树：

（1）、.string属性：
        如果tag中只有一个navigablestring类型子节点，
        那么这个tag可以使用，string得到子节点的内容，
        如果超过一个，返回none
（2）、.strings属性：
        获取所有内容，返回会议generator
        经常使用的是：
        list(soup.div.strings)
（3）、直接子节点：
        .contents （tag下的子节点，知识儿子节点，不包过下一代的结点）
        .children
（4）、所有子孙结点：
        .descendants又有子节点的递归
（5）、父节点；
        .parent 只是上一层的父亲节点
        .parents 之上的所有父节点
（6）、兄弟节点：
        .next_sibing 下一个兄弟节点点
        .previous_sibling 上一个兄弟节点
（7）、前后结点：
        .next_element 后结点
        .previous_element 前结点

        .next_elements 所有的后结点
        .previous_elements 所有的前结点


（8）、搜索文档树：
    find_all()  当前标签的所有子节点
    soup.find_all()
    find('div',class_='all').find_all('a')
（9）、通过正则查找：
    import re
    soup.find_all(re.compile('*p'))  

    标签查找
    soup.find_all(['p','div'])       表示的是从p 和div标签中查找

Fork me on GitHub