无法在类似字节的对象上使用字符串模式 (Python)

我正在用 python 创建一个爬虫来列出网站中的所有链接,但出现错误,我看不到导致错误的原因:


Traceback (most recent call last):

  File "vul_scanner.py", line 8, in <module>

    vuln_scanner.crawl(target_url)

  File "C:\Users\Lenovo x240\Documents\website\website\spiders\scanner.py", line 18, in crawl

    href_links= self.extract_links_from(url)

  File "C:\Users\Lenovo x240\Documents\website\website\spiders\scanner.py", line 15, in extract_links_from

    return re.findall('(?:href=")(.*?)"', response.content)

  File "C:\Users\Lenovo x240\AppData\Local\Programs\Python\Python38\lib\re.py", line 241, in findall

    return _compile(pattern, flags).findall(string)

TypeError: cannot use a string pattern on a bytes-like object

我的代码是:在scanner.py文件中:


# To ignore numpy errors:

#     pylint: disable=E1101

import urllib

import requests

import re

from urllib.parse import urljoin


class Scanner:

    def __init__(self, url):

        self.target_url = url

        self.target_links = []


    def extract_links_from(self, url):

        response = requests.get(url)

        return re.findall('(?:href=")(.*?)"', response.content)


    def crawl(self, url):

        href_links= self.extract_links_from(url)

        for link in href_links:

            link = urljoin(url, link)   


            if "#" in link:

                link = link.split("#")[0]


            if self.target_url in link and link not in self.target_links:

                self.target_links.append(link)

                print(link)

                self.crawl(link)     

在 vul_scanner.py 文件中:


import scanner

# To ignore numpy errors:

#     pylint: disable=E1101



target_url = "https://www.amazon.com"

vuln_scanner = scanner.Scanner(target_url)

vuln_scanner.crawl(target_url)

我运行的命令是:python vul_scanner.py


慕桂英3389331
浏览 96回答 1
1回答

撒科打诨

return re.findall('(?:href=")(.*?)"', response.content)response.content在本例中是二进制类型。因此,您可以使用response.text,这样您就可以获得纯文本并可以按照您现在计划执行的操作来处理它,如果您想继续沿着二进制道路前进。
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python