使用 Java JSoup 和 Selenium 抓取完整的动态 HTML 内容

我正在尝试抓取这个网站

https://www.dailystrength.org/search?query=aspirin&type=discussion

为我拥有的项目获取数据集（使用阿司匹林作为占位符搜索项）。

我决定用 Jsoup 做一个爬虫。但问题是帖子是通过 Ajax 请求动态带来的。该请求是使用“显示更多”按钮发出的

此按钮会导致问题

当显示整个内容时，它应该看起来像这样，带有文本“所有消息已加载”

最终结果

import java.io.IOException;

import java.util.ArrayList;

import java.util.logging.Level;

import java.util.logging.Logger;

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import org.openqa.selenium.*;

import org.openqa.selenium.chrome.*;

/**

* @author Ahmed

public class Crawler {

public static void main(String args[]) {

Document search_result;

String requested[] = new String[]{"aspirin"/*, "Fentanyl"*/};

ArrayList<Newsfeed_item> threads = new ArrayList();

String query = "https://www.dailystrength.org/search?query=";

try {

for (int i = 0; i < requested.length; i++) {

search_result = Jsoup.connect(query+requested[i]+"&type=discussion").get();

Elements posts = search_result.getElementsByClass("newsfeed__item");

for (Element item : posts) {

Elements link=item.getElementsByClass("newsfeed__btn-container posts__discuss-btn");

Newsfeed_item currentItem=new Newsfeed_item();

currentItem.replysLink=link.attr("abs:href");

Document reply_result=Jsoup.connect(currentItem.replysLink).get();

Elements description = reply_result.getElementsByClass("posts__content");

currentItem.description=description.text();

currentItem.subject=requested[i];

System.out.println(currentItem);

}

} catch (IOException ex) {

Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);

}

这段代码只给了我几个显示的帖子，而不是隐藏的帖子。我知道 JSoup 不能用于此问题，因此我尝试查找 selenium 的来源以显示完整内容并下载它以进行爬网。

白猪掌柜的

浏览 391回答 1

1回答

慕桂英546537

我将尝试在没有硒的情况下继续这种方法。使用 Web 浏览器的调试器及其网络选项卡，您可以查看浏览器发送的所有请求。查看单击“显示更多”时发生的情况很有用。您可以看到从该网址加载了下一页： https://www.dailystrength.org/search/ajax?query=aspirin&type=discussion&page=2&_=1549130275261 您可以通过更改参数来获取更多页面page=2。不幸的是，结果是包含转义 HTML 的 JSON，因此您必须使用一些 JSON 库来解析它，获取 HTML，然后使用 Jsoup 解析它。这会很好，因为这个 JSON 还包含一个变量"has_more":true，所以你会知道是否有更多内容。

随时随地看视频慕课网APP