如何从使用 php 和 javascript 的网页中使用 python 解析信息

我正在尝试从此网页获取所有事件和这些事件的其他元数据:https : //alando-palais.de/events


我的问题是,结果(html)不包含我正在寻找的信息。我想,它们“隐藏”在一些 php 脚本后面。这个网址:' https://alando-palais.de/wp/wp-admin/admin-ajax.php '


任何想法,如何等待页面完全加载,或者我必须使用什么样的方法来获取事件信息?


这是我现在的脚本:-):


from bs4 import BeautifulSoup

from urllib.request import urlopen, urljoin

from urllib.parse import urlparse

import re

import requests


if __name__ == '__main__':

    target_url = 'https://alando-palais.de/events'

    #target_url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'


    soup = BeautifulSoup(requests.get(target_url).text, 'html.parser')

    print(soup)


    links = soup.find_all('a', href=True)

    for x,link in enumerate(links):

        print(x, link['href'])



#    for image in images:

#        print(urljoin(target_url, image))

预期输出将类似于:


日期:08.03.2019

标题:阁楼俱乐部特别节目:麦外和朋友们

img: https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg "

这是这个结果的一些东西:


<div class="vc_gitem-zone vc_gitem-zone-b vc_custom_1547045488900 originalbild vc-gitem-zone-height-mode-auto vc_gitem-is-link" style="background-image: url(https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg) !important;">

    <a href="https://alando-palais.de/event/penthouse-club-special-maiwai-friends" title="Penthouse Club Special: Maiwai &#038; Friends" class="vc_gitem-link vc-zone-link"></a>    <img src="https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg" class="vc_gitem-zone-img" alt="">  <div class="vc_gitem-zone-mini">

        <div class="vc_gitem_row vc_row vc_gitem-row-position-top"><div class="vc_col-sm-6 vc_gitem-col vc_gitem-col-align-left">   <div class="vc_gitem-post-meta-field-Datum eventdatum vc_gitem-align-left"> 08.03.2019

    </div>


吃鸡游戏
浏览 156回答 2
2回答

慕容3067478

您可以模仿该页面发布的 xhr 帖子from bs4 import BeautifulSoupimport requestsimport pandas as pdurl = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'data = {&nbsp; 'action': 'vc_get_vc_grid_data',&nbsp; 'vc_action': 'vc_get_vc_grid_data',&nbsp; 'tag': 'vc_basic_grid',&nbsp; 'data[visible_pages]' : 5,&nbsp; 'data[page_id]' : 30,&nbsp; 'data[style]' : 'all',&nbsp; 'data[action]' : 'vc_get_vc_grid_data',&nbsp; 'data[shortcode_id]' : '1551112413477-5fbaaae1-0622-2',&nbsp; 'data[tag]' : 'vc_basic_grid',&nbsp; 'vc_post_id' : '30',&nbsp; '_vcnonce' : 'cc8cc954a4'&nbsp;&nbsp;}res = requests.post(url, data = data)soup = BeautifulSoup(res.content, 'lxml')dates = [item.text.strip() for item in soup.select('.vc_gitem-zone[style*="https://alando-palais.de"]')]textInfo = [item for item in soup.select('.vc_gitem-link')][::2]imageLinks = [item['src'].strip() for item in soup.select('img')]titles = []links = []for item in textInfo:&nbsp; &nbsp; titles.append(item['title'])&nbsp; &nbsp; links.append(item['href'])results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])print(results)或硒:from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECimport pandas as pdurl = 'https://alando-palais.de/events#'driver = webdriver.Chrome()driver.get(url)dates = [item.text.strip() for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".vc_gitem-zone[style*='https://alando-palais.de']"))) if len(item.text)]textInfo = [item for item in driver.find_elements_by_css_selector('.vc_gitem-link')][::2]textInfo = textInfo[: int(len(textInfo) / 2)]imageLinks = [item.get_attribute('src').strip() for item in driver.find_elements_by_css_selector('a + img')][::2]titles = []links = []for item in textInfo:&nbsp; &nbsp; titles.append(item.get_attribute('title'))&nbsp; &nbsp; links.append(item.get_attribute('href'))results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])print(results)driver.quit()

智慧大石

我最好建议您使用selenium绕过所有服务器限制。已编辑from selenium import webdriverdriver = webdriver.Firefox()driver.get("https://alando-palais.de/events")elems = driver.find_elements_by_xpath("//a[@href]")for elem in elems:&nbsp; &nbsp; print elem.get_attribute("href")
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python