从 urlReq(url) 中删除 'urllib.error.HTTPError:

嘿伙计们怎么了?:)

我正在尝试使用一些 url 参数来抓取网站。如果我使用为url1,url2 URL3它WORKS得当,它打印我的常规输出我想要(HTML) - >


import bs4

from urllib.request import urlopen as urlReq

from bs4 import BeautifulSoup as soup


# create urls

url1 = 'https://en.titolo.ch/sale'

url2 = 'https://en.titolo.ch/sale?limit=108'

url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212'

url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'


# opening up connection on each url, grabbing the page

uClient = urlReq(url4)

page_html = uClient.read()

uClient.close()


# parsing the downloaded html

page_soup = soup(page_html, "html.parser")


# print the html

print(page_soup.body.prettify())

-> 但是当我尝试“url4”时, url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'它给了我下面的错误。我究竟做错了什么?

- 也许它与饼干有关?-> 但是为什么它对其他 url 有效...

- 也许他们只是阻止了抓取尝试?

- 如何在 URL 中使用多个参数来避免此错误?


urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.

The last 30x error message was:

Moved Temporarily


我在这里先向您的帮助表示感谢!干杯艾伦


我已经尝试过的:我尝试了请求库


import requests


url = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'

r = requests.get(url)

html = r.text

print(html)


<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>

<title>403 Forbidden</title>

</head><body>

<h1>Forbidden</h1>

<p>You don't have permission to access /sale

on this server.</p>

</body></html>


[Finished in 0.375s]


aluckdog
浏览 209回答 1
1回答

繁花不似锦

如果使用requestspackage 并在标头中添加用户代理,则看起来它会收到200所有 4 个链接的响应。所以尝试添加用户代理标头:headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}import requestsfrom bs4 import BeautifulSoup as soup# create urlsurl1 = 'https://en.titolo.ch/sale'url2 = 'https://en.titolo.ch/sale?limit=108'url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212'url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}url_list = [url1, url2, url3, url4]for url in url_list:# opening up connection on each url, grabbing the page&nbsp; &nbsp; response = requests.get(url, headers=headers)&nbsp; &nbsp; print (response.status_code)输出:200200200200所以:import requestsheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}url = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'r = requests.get(url, headers=headers)html = r.textprint(html)
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python