在 Javascript 函数中定义了一个数组images,需要将其从字符串中提取并转换为 Python 列表对象。
PythonBeautifulsoup被用于进行解析。
var images = [
{
src: "http://example.com/bar/001.jpg",
title: "FooBar One"
},
{
src: "http://example.com/bar/002.jpg",
title: "FooBar Two"
},
]
;
问题:为什么我下面的代码无法捕获这个images数组,我们该如何解决?
谢谢!
所需 的输出 Python 列表对象。
[
{
src: "http://example.com/bar/001.jpg",
title: "FooBar One"
},
{
src: "http://example.com/bar/002.jpg",
title: "FooBar Two"
},
]
实际代码
import re
from bs4 import BeautifulSoup
# Example of a HTML source code containing `images` array
html = '''
<html>
<head>
<script type="text/javascript">
$(document).ready(function(){
var images = [
{
src: "http://example.com/bar/001.jpg",
title: "FooBar One"
},
{
src: "http://example.com/bar/002.jpg",
title: "FooBar Two"
},
]
;
var other_data = [{"name": "Tom", "type": "cat"}, {"name": "Jerry", "type": "dog"}];
</script>
<body>
<p>Some content</p>
</body>
</head>
</html>
'''
pattern = re.compile('var images = (.*?);')
soup = BeautifulSoup(html, 'lxml')
scripts = soup.find_all('script') # successfully captures the <script> element
for script in scripts:
data = pattern.match(str(script.string)) # NOT extracting the array!!
if data:
print('Found:', data.groups()[0]) # NOT being printed
白衣非少年
桃花长相依
慕容708150
RISEBY
相关分类