用 Beautiful Soup 抓取时两页之间的差异

首页课程实战体系课手记专栏慕课教程

用 Beautiful Soup 抓取时两页之间的差异

我从 Python 和 Beautiful Soup 开始，我正在 JSON 文件中抓取 Google PlayStore 和应用程序元数据。这是我的代码：

def createjson(app_link):

url = 'https://play.google.com/store/apps/details?id=' + app_link

response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')

bs = BeautifulSoup(response.text,"lxml")

result = [e.text for e in bs.find_all("div",{"class":"hAyfc"})]

apptype = [e.text for e in bs.find_all("div",{"class":"hrTbp R8zArc"})]

data = {}

data['appdata'] = []

data['appdata'].append({

'name': html_soup.find(class_="AHFaub").text,

'updated': result[1][7:],

'apkSize': result[2][4:],

'offeredBy': result[9][10:],

'currentVersion': result[4][15:]

})

jsonfile = "allappsdata.json" #Get all the appS infos in one JSON

with open(jsonfile, 'a+') as outfile:

json.dump(data, outfile)

我的“结果”变量在特定应用程序页面中查找字符串，问题在于 Google 正在更改两个不同页面之间的顺序。有时 result[1] 是应用程序名称，有时它是 result[2]；我需要的其他元数据也有同样的问题（“更新”、“apkSize”等...）我该如何处理这些变化。是否有可能以不同的方式刮擦？谢谢

小唯快跑啊

浏览 190回答 1

1回答

紫衣仙女

问题是 python 循环没有排序，将其保存为字典而不是列表。改变你result = [e....]的result = {}details = bs.find_all("div",{"class":"hAyfc"})for item in details:    label = item.findChild('div', {'class' : 'BgcNfc'})    value = item.findChild('span', {'class' : 'htlgb'})    result[label.text] = value.text也data['appdata']...有data['appdata'].append({    'name': html_soup.find(class_="AHFaub").text,    'updated': result['Updated'],    'apkSize': result['Size'],    'offeredBy': result['Offered By'],    'currentVersion': result['Current Version']

0 0

随时随地看视频慕课网APP