抓取 - 查找每场比赛的最后 5 场得分 - 在 html 中

我希望你能帮助我得到最后 5 分,我无法得到它,请帮助我。 from selenium import webdriver import pandas as pd from pandas import ExcelWriter from openpyxl.workbook import Workbook 导入时间 as t import xlsxwriter pd.set_option('display.max_rows', 5, 'display.max_columns', None, 'display.width',无)浏览器 = webdriver.Firefox()

https://img3.mukewang.com/65113bf30001e90d06530757.jpg

browser.get('https://www.mismarcadores.com/futbol/espana/laliga/resultados/')

print("Current Page Title is : %s" %browser.title)


aux_ids = browser.find_elements_by_css_selector('.event__match.event__match--static.event__match--oneLine')


ids=[]

i = 0 

for  aux in aux_ids:

    if i < 1:

        ids.append( aux.get_attribute('id') )

        i+=1


data=[]

for idt in ids:

    id_clean = idt.split('_')[-1]   

    browser.execute_script("window.open('');")

    browser.switch_to.window(browser.window_handles[1])

    browser.get(f'https://www.mismarcadores.com/partido/{id_clean}/#h2h;overall')

    t.sleep(5)

    p_ids = browser.find_elements_by_css_selector('h2h-wrapper')

    #here the code of the last 5 score of each match


白板的微信
浏览 79回答 1
1回答

动漫人物

我相信您可以使用 Firefox 浏览器,但尚未对其进行测试。我使用 chrome,因此如果您想使用 chromedriver,请检查浏览器的版本并下载正确的浏览器,并将其添加到您的系统路径中。这种方法的唯一问题是它会打开一个浏览器窗口,直到页面加载(因为我们正在等待 javascript 生成匹配数据)。如果您还需要任何其他信息,请告诉我。祝你好运!https://chromedriver.chromium.org/downloads已知问题:有时检索匹配数据时会抛出索引超出范围。这是我正在寻找的东西,因为它看起来有时每个链接上的 xpath 都会发生一些变化。from selenium import webdriverfrom lxml import htmlfrom lxml.html import HtmlElementdef test():&nbsp; &nbsp; # Here we specified the urls to for testing purpose&nbsp; &nbsp; urls = ['https://www.mismarcadores.com/partido/noIPZ3Lj/#h2h;overall'&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ]&nbsp; &nbsp; # a loop to go over all the urls&nbsp; &nbsp; for url in urls:&nbsp; &nbsp; &nbsp; &nbsp; # We will print the string and format it with the url we are currently checking, Also we will print the&nbsp; &nbsp; &nbsp; &nbsp; # result of the function get_last_5(url) where url is the current url in the for loop.&nbsp; &nbsp; &nbsp; &nbsp; print("Scores after this match {u}".format(u=url), get_last_5(url))def get_last_5(url):&nbsp; &nbsp; print("processing {u}, please wait...".format(u=url))&nbsp; &nbsp; # here we get a instance of the webdriver&nbsp; &nbsp; browser = webdriver.Chrome()&nbsp; &nbsp; # now we pass the url we want to get&nbsp; &nbsp; browser.get(url)&nbsp; &nbsp; # in this variable, we will "store" the html&nbsp; data as a string. We get it from here because we need to wait for&nbsp; &nbsp; # the page to load and execute their javascript code in order to generate the matches data.&nbsp; &nbsp; innerHTML = browser.execute_script("return document.body.innerHTML")&nbsp; &nbsp; # Now we will assign this to a variable of type HtmlElement&nbsp; &nbsp; tree: HtmlElement = html.fromstring(innerHTML)&nbsp; &nbsp; # the following variables: first_team,second_team,match_date and rows are obtained via xpath method(). To get the&nbsp; &nbsp; # xpath go to chrome browser,open it and load one of the url to check the DOM. Now if you wish to check the xpath&nbsp; &nbsp; # of each of this variables (elements in case of html), right click on the element->click inspect->the inspect&nbsp; &nbsp; # panel will appear->the clicked element wil appear selected on the inspect panel->right click on it->Copy->Copy&nbsp; &nbsp; # Xpath. first_team,second_team and match_date are obtained from the "title" section. Rows are obtained from the&nbsp; &nbsp; # table of last matches in the tbody content&nbsp; &nbsp; # When using xpath it will return a list of HtmElement because it will try to find all the elements that match our&nbsp; &nbsp; # xpath, so that is why we use [0] (to get the first element of the list). This will give use access to a&nbsp; &nbsp; # HtmlElement object so now we can access its text attribute.&nbsp; &nbsp; first_team = tree.xpath('//*[@id="flashscore"]/div[1]/div[1]/div[2]/div/div/a')[0].text&nbsp; &nbsp; print((type(first_team)))&nbsp; &nbsp; second_team = tree.xpath('//*[@id="flashscore"]/div[1]/div[3]/div[2]/div/div/a')[0].text&nbsp; &nbsp; # [0:8] is used to slice the string because in the title it contains also the time of the match ie.(10.08.2020&nbsp; &nbsp; # 13:00) . To use it for comparing each row we need only (10.08.20), so we get from position 0, 8 characters ([0:8])&nbsp; &nbsp; match_date = tree.xpath('//*[@id="utime"]')[0].text[0:8]&nbsp; &nbsp; # when getting the first element with [0], we get a HtmlElement object( which is the "table" that have all matches&nbsp; &nbsp; # data). so we want to get all the children of it, which are all the "rows(elements)" inside it. getchildren()&nbsp; &nbsp; # will also return a list of object of type HtmlElement. In this case we are also slicing the list with [:-1]&nbsp; &nbsp; # because the last element inside the "table" is the button "Mostar mas partidos", so we want to take that out.&nbsp; &nbsp; rows = tree.xpath('//*[@id="tab-h2h-overall"]/div[1]/table/tbody')[0].getchildren()[:-1]&nbsp; &nbsp; # we quit the browser since we do not need this anymore, we could do it after assigning innerHtml, but no harm&nbsp; &nbsp; # doing it here unless you wish to close it before doing all this assignment of variables.&nbsp; &nbsp; browser.quit()&nbsp; &nbsp; # this match_position variable will be the position of the match we currently have in the title.&nbsp; &nbsp; match_position = None&nbsp; &nbsp; # Now we will iterate over the rows and find the match. range(len(rows)) is just to get the count of rows to know&nbsp; &nbsp; # until when to stop iterating.&nbsp; &nbsp; for i in range(len(rows)):&nbsp; &nbsp; &nbsp; &nbsp; # now we use the is_match function with the following parameter: first_team,second team, match_date and the&nbsp; &nbsp; &nbsp; &nbsp; # current row which is row[i]. if the function return true we found the match position and we assign (i+1) to&nbsp; &nbsp; &nbsp; &nbsp; # the match_position variable. i+1 because we iterate from 0.&nbsp; &nbsp; &nbsp; &nbsp; if is_match(first_team, second_team, match_date, rows[i]):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; match_position = i + 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # now we stop the for no need to go further when we find it.&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; break&nbsp; &nbsp; # Since we only want the following 5 matches score, we need to check if we have 5 rows beneath our match. If&nbsp; &nbsp; # adding 5 from the match position is less than the number of rows then we can do it, if not we will only get the&nbsp; &nbsp; # rows beneath it(maybe 0,1,2,3 or 4 rows)&nbsp; &nbsp; if (match_position + 5) < len(rows):&nbsp; &nbsp; &nbsp; &nbsp; # Again we are slicing the list, in this case 2 times [match_position:] (take out all the rows before the&nbsp; &nbsp; &nbsp; &nbsp; # match position), then from the new list obtained from that we do [:5] which is start from the 0 position&nbsp; &nbsp; &nbsp; &nbsp; # and stop on 5 [start:stop]. we use rows=rows beacause when slicing you get a new list so you can not do&nbsp; &nbsp; &nbsp; &nbsp; # rows[match_position:][:5] you need to assign it to a variable. I am using same variable but you can assign&nbsp; &nbsp; &nbsp; &nbsp; # it to a new one if you wish.&nbsp; &nbsp; &nbsp; &nbsp; rows = rows[match_position:][:5]&nbsp; &nbsp; else:&nbsp; &nbsp; &nbsp; &nbsp; # since we do not have enough rows, just get the rows beneath our position.&nbsp; &nbsp; &nbsp; &nbsp; rows = rows[match_position:len(rows)]&nbsp; &nbsp; # Now to get the list of scores we are using a list comprehension in here but I will explain it as a for loop.&nbsp; &nbsp; # Before that, you need to know that each row(<tr> element in html) has 6 td elements inside it, the number 5 is&nbsp; &nbsp; # the score of the match. then inside each "score element" we have a span element and then a strong element,&nbsp; &nbsp; # something like&nbsp; &nbsp; # <tr>&nbsp; &nbsp; # <td></td>&nbsp; &nbsp; # <td></td>&nbsp; &nbsp; # <td></td>&nbsp; &nbsp; # <td></td>&nbsp; &nbsp; # <td><span><strong>1:2</strong></span></td>.&nbsp; &nbsp; # <td></td>&nbsp; &nbsp; # </tr>&nbsp; &nbsp; # Now, That been said, since each row is a HtmlElement object , we can go in a for loop as following:&nbsp; &nbsp; scores = []&nbsp; &nbsp; for row in rows:&nbsp; &nbsp; &nbsp; &nbsp; data = row.getchildren()[4].getchildren()[0].text_content()&nbsp; &nbsp; &nbsp; &nbsp; # not the best way but we will get al the text content on the element, in this case the span element,&nbsp; &nbsp; &nbsp; &nbsp; # if the string has more than 5 characters i.e. "1 : 2" then we will take as if it is i.e. "1 : 2(0 : 1)". So&nbsp; &nbsp; &nbsp; &nbsp; # in this case we want to slice it from the 2nd character from right to left and get 5 characters from that&nbsp; &nbsp; &nbsp; &nbsp; # position.&nbsp; &nbsp; &nbsp; &nbsp; # using a ternary expression here, if the length of the string is equal to 5 then this is our score,&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; # if not then we have to slice it and get the last part, from -6 which is the white space before then 2 (in&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; # our example) to -1 (which is the 1 before the last ')' ).&nbsp; &nbsp; &nbsp; &nbsp; score = data if len(data) == 5 else data[-6:-1]&nbsp; &nbsp; &nbsp; &nbsp; scores.append(score)&nbsp; &nbsp; print("finished processing {u}.".format(u=url))&nbsp; &nbsp; # now we return the scores&nbsp; &nbsp; return scoresdef is_match(t1, t2, match_date, row):&nbsp; &nbsp; # from each row we want to compare, t1,t2,match_date (this are obtained from the title) with the rows team1,&nbsp; &nbsp; # team2 and date. Each row has 6 element inside it. Please read all the code on get_last_5 before reading this&nbsp; &nbsp; # explanation. so the for this row, date is in position 0, team1 in 2, team2 in 3.&nbsp; &nbsp; # <td><span>10.03.20</span></td>&nbsp; &nbsp; date = row.getchildren()[0].getchildren()[0].text&nbsp; &nbsp; # <td><span>TeamName</span></td> (when the team lost) or&nbsp; &nbsp; # <td><span><strong>TeamName</strong></span></td> (when the team won)&nbsp; &nbsp; team1element = row.getchildren()[2].getchildren()[0]&nbsp; # this is the span element&nbsp; &nbsp; # using a ternary expression (condition_if_true if condition else condition_if_false)&nbsp; &nbsp; # https://book.pythontips.com/en/latest/ternary_operators.html&nbsp; &nbsp; # if span element have childrens , (getchildren()>0) then the team name is team1element.getchildren()[0].text&nbsp; &nbsp; # which is the text of the strong element, if not the jsut get the text from the span element.&nbsp; &nbsp; mt1 = team1element.getchildren()[0].text if len(team1element.getchildren()) > 0 else team1element.text&nbsp; &nbsp; # repeat the same as team 1&nbsp; &nbsp; team2element = row.getchildren()[3].getchildren()[0]&nbsp; &nbsp; mt2 = team2element.getchildren()[0].text if len(team2element.getchildren()) > 0 else team2element.text&nbsp; &nbsp; # basically we can compare only the date, but jsut to be sure we compare the names also. So, if the dates and the&nbsp; &nbsp; # names are the same this is our match row.&nbsp; &nbsp; if match_date == date and t1 == mt1 and t2 == mt2:&nbsp; &nbsp; &nbsp; &nbsp; # we found it so return true&nbsp; &nbsp; &nbsp; &nbsp; return True&nbsp; &nbsp; # if not the same then return false&nbsp; &nbsp; return False
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Html5