Publish Date: 2019-06-06

Word Count: 587

Read Times: 2 Min

Read Count:

前面大致介绍了Selector，但是在实际开发中，我们基本上都是不需要手动创建Selector对象的。

Response对象的selector属性被第一次访问时，Response对象内部就会自动创建一个Selector对象，并且将该Selector对象缓存。

对比一下

第四讲中，我们使用的是HtmlResponse对象构造Selector，如下：

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

text = """
<ul>
    <li>Python</li>
    <li>Java</li>
    <li>JavaScript</li>
</ul>
"""
response = HtmlResponse(url="baidu.com", body=text, encoding="utf8")
selector = Selector(response = response)
print(selector) 
#<Selector xpath=None data='<html><body><ul>\n    <li>Python</li>\n   '>

接下来，我们不创建Selector，使用Response内置的Selector，如下：

from scrapy.http import HtmlResponse

text = """
<ul>
    <li>Python</li>
    <li>Java</li>
    <li>JavaScript</li>
</ul>
"""
response = HtmlResponse(url="baidu.com", body=text, encoding="utf8")
selector = response.selector
print(selector)
#<Selector xpath=None data='<html><body><ul>\n    <li>Python</li>\n   '>

结果是一致的，也就是说，Response内部会自动创建Selector对象。

不妨追踪源码：

①在pycharm中，按住ctrl``，点击上面源码中的HtmlResponse`追踪，可以看见：

class HtmlResponse(TextResponse):
    pass

②不难发现，HtmlResponse是继承了TextResponse对象，继续追踪，可以看见：

class TextResponse(Response):

    def __init__(self, *args, **kwargs):
        ...
        self._cached_selector = None
        ...
    @property
    def selector(self):
        from scrapy.selector import Selector
        if self._cached_selector is None:
            self._cached_selector = Selector(self)
        return self._cached_selector

（摘取部分）

使用xpath案例

不难看出Response是以自身参数创建的Selector对象。
也就是说我们可以使用Response内部内置的Selector对象，然后使用XPath和CSS方法。如下：

from scrapy.http import HtmlResponse

text = """
<ul>
    <li>Python</li>
    <li>Java</li>
    <li>JavaScript</li>
</ul>
"""
response = HtmlResponse(url="baidu.com", body=text, encoding="utf8")
selector = response.selector
res = selector.xpath(".//li/text()").re("J\w+")
print(res)
#['Java', 'JavaScript']

另一种方式：

但是，为了方便用户使用，Response对象提供了xpath和css方法，他们分别调用内置Selector对象的xpath和css方法。案例：

response = HtmlResponse(url="baidu.com", body=text, encoding="utf8")
res = response.xpath(".//li/text()").re("J\w+")
print(res)
#['Java', 'JavaScript']

（上面有部分没写，和上一个案例的一样）
不妨还是追踪一下源码：
追踪到TextResponse可以看见，构成xpath的源码，如下：

def xpath(self, query, **kwargs):
    return self.selector.xpath(query, **kwargs)

def css(self, query):
    return self.selector.css(query)

提示： css和xpath都是选择器，用于提取数据。

下面，我们就来介绍这两个选择器。

Reprint policy

《scrapy-5 | Response内置Selector》 by 梦否 is licensed under a Creative Commons Attribution 4.0 International License

scrapy-6 | Response内置XPath选择器

和Selector类似，为了方便用户编码操作，XPath在Response中也集成了。Xpath（XML Path Language）XML路径语言，是一种用来确定xml文档中某元素位置的语言。提示： HTML属于xml在第四讲案例中我们已

2019-06-07 scrapy

scrapy

scrapy-4 | Selector提取数据

数据处理 Python中常用的处理HTTP解析库： BeautifulSoup很流行的HTTP解析库，API简洁易用，但是解析速度比较慢。 lxml由C语言编写的xml解析库，API相对复杂，解析速度快。 Scrapy的Selector

2019-06-06 scrapy

scrapy

对比一下

不妨追踪源码：

使用xpath案例

另一种方式：

你的赏识是我前进的动力