使用 BeautifulSoup、Python、Regex 在 Javascript 函数中获取变量

4回答

白衣非少年

您可以使用较短的惰性正则表达式和hjson库来处理未引用的键import re, hjsonhtml = '''<html><head><script type="text/javascript">    $(document).ready(function(){        var images = [            {                   src: "http://example.com/bar/001.jpg",                  title: "FooBar One"             },              {                   src: "http://example.com/bar/002.jpg",                  title: "FooBar Two"             },          ]        ;        var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];</script>'''p = re.compile(r'var images = (.*?);', re.DOTALL)data = hjson.loads(p.findall(html)[0])print(data)

0 0

桃花长相依

方法一也许， \bvar\s+images\s*=\s*(\[[^\]]*\])可能在某种程度上起作用：测试import refrom bs4 import BeautifulSoup# Example of a HTML source code containing `images` arrayhtml = '''<html><head><script type="text/javascript">    $(document).ready(function(){        var images = [            {                   src: "http://example.com/bar/001.jpg",                  title: "FooBar One"             },              {                   src: "http://example.com/bar/002.jpg",                  title: "FooBar Two"             },          ]        ;        var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];</script><body><p>Some content</p></body></head></html>'''soup = BeautifulSoup(html, 'html.parser')scripts = soup.find_all('script')  # successfully captures the <script> elementfor script in scripts:    data = re.findall(        r'\bvar\s+images\s*=\s*(\[[^\]]*\])', script.string, re.DOTALL)    print(data[0])输出[ {src：“ http://example.com/bar/001.jpg ”，标题：“FooBar One” }，{src：“ http://example.com/bar/002.jpg ”，标题：“ FooBar 两个" },]如果您想简化/修改/探索表达式，它已在regex101.com的右上角面板中进行了说明。如果您愿意，您还可以在此链接中观看它如何与一些示例输入匹配。方法二另一种选择是：import restring = '''<html><head><script type="text/javascript">    $(document).ready(function(){        var images = [            {                   src: "http://example.com/bar/001.jpg",                  title: "FooBar One"             },              {                   src: "http://example.com/bar/002.jpg",                  title: "FooBar Two"             },          ]        ;        var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];</script><body><p>Some content</p></body></head></html>'''expression = r'src:\s*"([^"]*)"\s*,\s*title:\s*"([^"]*)"'matches = re.findall(expression, string, re.DOTALL)output = []for match in matches:    output.append(dict({"src": match[0], "title": match[1]}))print(output)输出[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]

0 0

慕容708150

这是一种到达那里的方法，没有正则表达式，甚至没有 beautifulsoup - 只是简单的 Python 字符串操作 - 只需 4 个简单的步骤 :)step_1 = html.split('var images = [')step_2 = " ".join(step_1[1].split())step_3 = step_2.split('] ; var other_data = ')step_4= step_3[0].replace('}, {','}xxx{').split('xxx')print(step_4)输出：['{ src: "http://example.com/bar/001.jpg", title: "FooBar One" }', '{ src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ']

0 0

RISEBY

re.match 从字符串的开头匹配。您的正则表达式必须传递整个字符串。利用pattern = re.compile('.*var images = (.*?);.*', re.DOTALL)该字符串仍然不是有效的 python 列表格式。您必须先进行一些操作才能申请ast.literal_evalfor script in scripts:    data = pattern.match(str(script.string))    if data:        list_str = data.groups()[0]        # Remove last comma        last_comma_index = list_str.rfind(',')        list_str = list_str[:last_comma_index] + list_str[last_comma_index+1:]        # Modify src to 'src' and title to 'title'        list_str = re.sub(r'\s([a-z]+):', r'"\1":', list_str)        # Strip        list_str = list_str.strip()        final_list = ast.literal_eval(list_str.strip())        print(final_list)输出[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]

0 0