通过网络抓取在 python 中使用正则表达式排除字符串的一部分

我正在尝试从电子商务网站上抓取一些数据用于个人项目。我正在尝试从 html 构建一个嵌套的字符串列表,但 html 的一部分出现问题。每个列表项如下所示:


<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>

我现在拥有的是一个正则表达式,它可以将 data-impressions 标签中的所有项目像这样转换并在逗号处分割它们:


list_return = [re.findall('\{([^{]+[^}\'></div>])', i) for i in bathshower_impressions]

list_return = [re.split(',', list_return[i][0]) for i in range(0, len(list_return))]

这为我提供了每个事物的列表列表,这些列表将成为字典中的键:值对。对于上面的示例,第二级项目如下:


[['"id"', '"01920"'],

  ['"name"', '"Sleepy"'],

  ['"price"', '12.95'],

  ['"brand"', '"Lush"'],

  ['"category"', '"Bubble Bar"'],

  ['"variant"', '"7 oz."'],

  ['"quantity"', '1'],

  ['"list"', '"/bath/bubble-bars/sleepy/9999901920.html"'],

  ['"dimension11"', '""'],

  ['"dimension12"', '"Naked'],

  ['Self Preserving'],

  ['Vegan"'],

  ['"dimension13"', '1'],

  ['"dimension14"', '1'],

  ['"dimension15"', 'true']]

我的问题是维度 12,我不知道如何排除该维度以逗号分隔,以便该列表显示为:


['"dimension12"', '"Naked,Self Preserving,Vegan"']

如有任何帮助,我们将不胜感激,谢谢。


MMMHUHU
浏览 47回答 1
1回答

繁花如伊

我想建议一种不同的方法。该属性值看起来像JSON,那么为什么不使用json模块呢?这样,您就有了一个现成的数据结构,可以进一步修改。import jsonfrom bs4 import BeautifulSouphtml_list = ["""<div class="impressions" data-impressions=\'{"id":"01920","name":"Sleepy","price":12.95,"brand":"Lush","category":"Bubble Bar","variant":"7 oz.","quantity":1,"list":"/bath/bubble-bars/sleepy/9999901920.html","dimension11":"","dimension12":"Naked,Self Preserving,Vegan","dimension13":1,"dimension14":1,"dimension15":true}\'></div>""",]data_structures = []for html_item in html_list:&nbsp; &nbsp; soup = BeautifulSoup(html_item, "html.parser").find("div", {"class": "impressions"})&nbsp; &nbsp; data_structures.append(json.loads(soup["data-impressions"]))print(data_structures)这会输出一个字典列表:[{'id': '01920', 'name': 'Sleepy', 'price': 12.95, 'brand': 'Lush', 'category': 'Bubble Bar', 'variant': '7 oz.', 'quantity': 1, 'list': '/bath/bubble-bars/sleepy/9999901920.html', 'dimension11': '', 'dimension12': 'Naked,Self Preserving,Vegan', 'dimension13': 1, 'dimension14': 1, 'dimension15': True}]要访问所需的密钥,只需执行以下操作:for data_item in data_structures:&nbsp; &nbsp; print(data_item["dimension12"])印刷:Naked,Self Preserving,Vegan
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python