使用 BeautifulSoup、Python、Regex 在 Javascript 函数中获取变量

在 Javascript 函数中定义了一个数组images,需要将其从字符串中提取并转换为 Python 列表对象。


PythonBeautifulsoup被用于进行解析。


        var images = [

            {   

                src: "http://example.com/bar/001.jpg",  

                title: "FooBar One" 

            },  

            {   

                src: "http://example.com/bar/002.jpg",  

                title: "FooBar Two" 

            },  

        ]

        ;

问题:为什么我下面的代码无法捕获这个images数组,我们该如何解决?


谢谢!


所需 的输出 Python 列表对象。


[

    {   

        src: "http://example.com/bar/001.jpg",  

        title: "FooBar One" 

    },  

    {   

        src: "http://example.com/bar/002.jpg",  

        title: "FooBar Two" 

    },  

]

实际代码


import re

from bs4 import BeautifulSoup


# Example of a HTML source code containing `images` array

html = '''

<html>

<head>

<script type="text/javascript">


    $(document).ready(function(){

        var images = [

            {   

                src: "http://example.com/bar/001.jpg",  

                title: "FooBar One" 

            },  

            {   

                src: "http://example.com/bar/002.jpg",  

                title: "FooBar Two" 

            },  

        ]

        ;

        var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];


</script>

<body>

<p>Some content</p>

</body>

</head>

</html>

'''


pattern = re.compile('var images = (.*?);')

soup = BeautifulSoup(html, 'lxml')

scripts = soup.find_all('script')  # successfully captures the <script> element

for script in scripts:

    data = pattern.match(str(script.string))  # NOT extracting the array!!

    if data:

        print('Found:', data.groups()[0])     # NOT being printed


噜噜哒
浏览 204回答 4
4回答

白衣非少年

您可以使用较短的惰性正则表达式和hjson库来处理未引用的键import re, hjsonhtml = '''<html><head><script type="text/javascript">&nbsp; &nbsp; $(document).ready(function(){&nbsp; &nbsp; &nbsp; &nbsp; var images = [&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; src: "http://example.com/bar/001.jpg",&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; title: "FooBar One"&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; src: "http://example.com/bar/002.jpg",&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; title: "FooBar Two"&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; ]&nbsp; &nbsp; &nbsp; &nbsp; ;&nbsp; &nbsp; &nbsp; &nbsp; var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];</script>'''p = re.compile(r'var images = (.*?);', re.DOTALL)data = hjson.loads(p.findall(html)[0])print(data)

桃花长相依

方法一也许,&nbsp;\bvar\s+images\s*=\s*(\[[^\]]*\])可能在某种程度上起作用:测试import refrom bs4 import BeautifulSoup# Example of a HTML source code containing `images` arrayhtml = '''<html><head><script type="text/javascript">&nbsp; &nbsp; $(document).ready(function(){&nbsp; &nbsp; &nbsp; &nbsp; var images = [&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; src: "http://example.com/bar/001.jpg",&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; title: "FooBar One"&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; src: "http://example.com/bar/002.jpg",&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; title: "FooBar Two"&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; ]&nbsp; &nbsp; &nbsp; &nbsp; ;&nbsp; &nbsp; &nbsp; &nbsp; var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];</script><body><p>Some content</p></body></head></html>'''soup = BeautifulSoup(html, 'html.parser')scripts = soup.find_all('script')&nbsp; # successfully captures the <script> elementfor script in scripts:&nbsp; &nbsp; data = re.findall(&nbsp; &nbsp; &nbsp; &nbsp; r'\bvar\s+images\s*=\s*(\[[^\]]*\])', script.string, re.DOTALL)&nbsp; &nbsp; print(data[0])输出[ {src:“ http://example.com/bar/001.jpg ”,标题:“FooBar One” },{src:“ http://example.com/bar/002.jpg ”,标题:“ FooBar 两个" },]如果您想简化/修改/探索表达式,它已在regex101.com的右上角面板中进行了说明。如果您愿意,您还可以在此链接中观看它如何与一些示例输入匹配。方法二另一种选择是:import restring = '''<html><head><script type="text/javascript">&nbsp; &nbsp; $(document).ready(function(){&nbsp; &nbsp; &nbsp; &nbsp; var images = [&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; src: "http://example.com/bar/001.jpg",&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; title: "FooBar One"&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; src: "http://example.com/bar/002.jpg",&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; title: "FooBar Two"&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; },&nbsp;&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; ]&nbsp; &nbsp; &nbsp; &nbsp; ;&nbsp; &nbsp; &nbsp; &nbsp; var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];</script><body><p>Some content</p></body></head></html>'''expression = r'src:\s*"([^"]*)"\s*,\s*title:\s*"([^"]*)"'matches = re.findall(expression, string, re.DOTALL)output = []for match in matches:&nbsp; &nbsp; output.append(dict({"src": match[0], "title": match[1]}))print(output)输出[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]

慕容708150

这是一种到达那里的方法,没有正则表达式,甚至没有 beautifulsoup - 只是简单的 Python 字符串操作 - 只需 4 个简单的步骤 :)step_1 = html.split('var images = [')step_2 = " ".join(step_1[1].split())step_3 = step_2.split('] ; var other_data = ')step_4= step_3[0].replace('}, {','}xxx{').split('xxx')print(step_4)输出:['{ src: "http://example.com/bar/001.jpg", title: "FooBar One" }',&nbsp;'{ src: "http://example.com/bar/002.jpg", title: "FooBar Two" }, ']

RISEBY

re.match 从字符串的开头匹配。您的正则表达式必须传递整个字符串。利用pattern = re.compile('.*var images = (.*?);.*', re.DOTALL)该字符串仍然不是有效的 python 列表格式。您必须先进行一些操作才能申请ast.literal_evalfor script in scripts:&nbsp; &nbsp; data = pattern.match(str(script.string))&nbsp; &nbsp; if data:&nbsp; &nbsp; &nbsp; &nbsp; list_str = data.groups()[0]&nbsp; &nbsp; &nbsp; &nbsp; # Remove last comma&nbsp; &nbsp; &nbsp; &nbsp; last_comma_index = list_str.rfind(',')&nbsp; &nbsp; &nbsp; &nbsp; list_str = list_str[:last_comma_index] + list_str[last_comma_index+1:]&nbsp; &nbsp; &nbsp; &nbsp; # Modify src to 'src' and title to 'title'&nbsp; &nbsp; &nbsp; &nbsp; list_str = re.sub(r'\s([a-z]+):', r'"\1":', list_str)&nbsp; &nbsp; &nbsp; &nbsp; # Strip&nbsp; &nbsp; &nbsp; &nbsp; list_str = list_str.strip()&nbsp; &nbsp; &nbsp; &nbsp; final_list = ast.literal_eval(list_str.strip())&nbsp; &nbsp; &nbsp; &nbsp; print(final_list)输出[{'src': 'http://example.com/bar/001.jpg', 'title': 'FooBar One'}, {'src': 'http://example.com/bar/002.jpg', 'title': 'FooBar Two'}]
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python