嘿伙计们怎么了?:)
我正在尝试使用一些 url 参数来抓取网站。如果我使用为url1,url2 URL3它WORKS得当,它打印我的常规输出我想要(HTML) - >
import bs4
from urllib.request import urlopen as urlReq
from bs4 import BeautifulSoup as soup
# create urls
url1 = 'https://en.titolo.ch/sale'
url2 = 'https://en.titolo.ch/sale?limit=108'
url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212'
url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'
# opening up connection on each url, grabbing the page
uClient = urlReq(url4)
page_html = uClient.read()
uClient.close()
# parsing the downloaded html
page_soup = soup(page_html, "html.parser")
# print the html
print(page_soup.body.prettify())
-> 但是当我尝试“url4”时, url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'它给了我下面的错误。我究竟做错了什么?
- 也许它与饼干有关?-> 但是为什么它对其他 url 有效...
- 也许他们只是阻止了抓取尝试?
- 如何在 URL 中使用多个参数来避免此错误?
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Temporarily
我在这里先向您的帮助表示感谢!干杯艾伦
我已经尝试过的:我尝试了请求库
import requests
url = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'
r = requests.get(url)
html = r.text
print(html)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /sale
on this server.</p>
</body></html>
[Finished in 0.375s]
繁花不似锦
相关分类