猿问

抓取主图像而不是缩略图

import requests


root_tag=["article", {"class":"sorted-article"}]

image_tag=["img",{"":""},"src"]

session = requests.Session()

response = session.get("https://phys.org/earth-news/", headers=headers)

webContent = response.content


for div in all_tab_data:

    image_url = None

    div_img = str(div)

    match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)

    if match!=None:

        image_url = match.group(0)

    else:

        image_url = div.find(image_tag[0],image_tag[1]).get(image_tag[2])

    if image_url!=None:

        if image_url[0] == '/' and image_url[1] != '/':

            image_url = main_url + image_url

我的图像 url 输出是output_url但图像的实际 url 是actual_url。我怎样才能抓取主图像?



一只斗牛犬
浏览 137回答 2
2回答

吃鸡游戏

用于beautifulsoup抓取所有新闻内容以获取图像:import requestsfrom bs4 import BeautifulSoupheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}with requests.Session() as session:    session.headers = headers    soup = BeautifulSoup(session.get("https://phys.org/earth-news/").text, "lxml")    news_list = [news_div.get("href") for news_div in soup.select('.news-link')]    for url in news_list:        soup = BeautifulSoup(session.get(url).text, "lxml")        img = soup.select_one(".article-img")        if img:            print(url, img.select_one('img').get("src"))        else:            print(url, "This news doesn't contain image")

慕神8447489

用于BeautifulSoup提取图像链接:import requestsfrom bs4 import BeautifulSoupurl = 'https://phys.org/earth-news/'headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')    for img in soup.select('.sorted-article img[data-src]'):    print( img['data-src'].replace('/175u/', '/800/') )印刷:https://scx1.b-cdn.net/csz/news/800/2020/biofuels.jpghttps://scx1.b-cdn.net/csz/news/800/2020/waterscarcity.jpghttps://scx1.b-cdn.net/csz/news/800/2020/soilerosion.jpghttps://scx1.b-cdn.net/csz/news/800/2020/hydropowerdam.jpghttps://scx1.b-cdn.net/csz/news/800/2019/flood.jpghttps://scx1.b-cdn.net/csz/news/800/2018/1-emissions.jpghttps://scx1.b-cdn.net/csz/news/800/2020/globalforest.jpghttps://scx1.b-cdn.net/csz/news/800/2020/fleeingthecl.jpghttps://scx1.b-cdn.net/csz/news/800/2020/watersecurity.jpghttps://scx1.b-cdn.net/csz/news/800/2019/2-water.jpghttps://scx1.b-cdn.net/csz/news/800/2020/japaneseexpe.jpghttps://scx1.b-cdn.net/csz/news/800/2020/6-scientistsco.jpghttps://scx1.b-cdn.net/csz/news/800/2020/housescollap.jpghttps://scx1.b-cdn.net/csz/news/800/2020/soil.jpghttps://scx1.b-cdn.net/csz/news/800/2020/32-researcherst.jpghttps://scx1.b-cdn.net/csz/news/800/2020/2-nasatracking.jpghttps://scx1.b-cdn.net/csz/news/800/2020/thelargersec.jpghttps://scx1.b-cdn.net/csz/news/800/2020/4-nasasterrasa.jpghttps://scx1.b-cdn.net/csz/news/800/2020/howtorecycle.jpghttps://scx1.b-cdn.net/csz/news/800/2020/newtoolstrac.jpg
随时随地看视频慕课网APP

相关分类

Python
我要回答