猿问

如何从网页(内的选项卡)的 HTML 页面源中提取数据?

我尝试了其他答案中指定的几种解决方案,例如试验不同的用户代理(Chrome、safari 等),以及使用 HTTPClient 和 BufferedReader 直接获取 HTML,但它们都不起作用。如何使 Android 输出类似于 Web 输出?这是我正在寻找的网络输出;(查看https://finance.yahoo.com/quote/AAPL/financials?p=AAPL 的页面源以获取完整输出 - 这基本上包含名为“Quarterly”的 AJAX 选项卡,其中包含一个表。我需要获取该数据,但 Android HTML 源代码没有,但网络源代码有。)

你有什么建议吗?谢谢。我的代码;

Document doc = Jsoup.connect(requestURL).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43")

                .timeout(600000).get();

        Elements tableDivs = doc.getElementsByAttributeValue("class", myClassName);

        Elements scriptTags = doc.getElementsByTag("script");

        for (Element script : scriptTags) {

            //System.out.println(script.data());

            Log.e("ONE", script.data());

        }


MYYA
浏览 290回答 1
1回答

肥皂起泡泡

雅虎财经重定向到 guce.oath.com,它会通知我们有关 cookie 和其他数据的使用,并要求在提供内容之前单击“接受”。如果我们清除 cokies 并刷新页面,我们也可以在浏览器中观察到。我们可以从 guce.oath.com 抓取链接,但我注意到最终 URL 有一个guccounter=2参数,如果我们使用该 URL,我们可以获得所需的响应。String requestURL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2";String userAgent = "My UAString";Document doc = Jsoup.connect(requestURL).userAgent(userAgent).get();由于数据不是 HTML 而是 JavaScript 代码,我们不能用 解析它jsoup,但我们可以使用正则表达式。Elements scriptTags = doc.getElementsByTag("script");String re = "root\\.App\\.main\\s*\\=\\s*(.*?);\\s*\\}\\(this\\)\\)\\s*;";String data = null;for (Element script : scriptTags) {&nbsp; &nbsp; Pattern pattern = Pattern.compile(re, Pattern.DOTALL);&nbsp; &nbsp; Matcher matcher = pattern.matcher(script.html());&nbsp; &nbsp; if (matcher.find()) {&nbsp; &nbsp; &nbsp; &nbsp; data = matcher.group(1);&nbsp; &nbsp; &nbsp; &nbsp; break;&nbsp; &nbsp; }}该data字符串应包含 JavaScript 代码中的字典,这是一个有效的 json 字符串,可以使用JSONObject.但是,在 Android Studio 上,据我所知,没有重定向。我尝试了几个用户代理字符串,但似乎页面是直接加载的。尽管如此,包含数据的 JavaScript 字典仍然存在,我们可以提取它,并使用JSONObject.Android Studio 代码:String requestURL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL";String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";String row = "totalRevenue";try {&nbsp; &nbsp; Document doc = Jsoup.connect(requestURL).userAgent(userAgent).get();&nbsp; &nbsp; String html = doc.html();&nbsp; &nbsp; //Log.d("html", html);&nbsp; &nbsp; Elements scriptTags = doc.getElementsByTag("script");&nbsp; &nbsp; String re = "root\\.App\\.main\\s*\\=\\s*(.*?);\\s*\\}\\(this\\)\\)\\s*;";&nbsp; &nbsp; for (Element script : scriptTags) {&nbsp; &nbsp; &nbsp; &nbsp; Pattern pattern = Pattern.compile(re, Pattern.DOTALL);&nbsp; &nbsp; &nbsp; &nbsp; Matcher matcher = pattern.matcher(script.html());&nbsp; &nbsp; &nbsp; &nbsp; if (matcher.find()) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; String data = matcher.group(1);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //Log.d("data", data);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; JSONObject jo = new JSONObject(data);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; JSONArray table = getTable(jo);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //Log.d("table", table.toString());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; String[] tableRow = getRow(table, row);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; String values = TextUtils.join(", ", tableRow);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Log.d("values", values);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }} catch (Exception e) {&nbsp; &nbsp; Log.e("err", "err", e);}这应该解析数据并选择“总收入”值。我使用的getTable和getRow方法:private JSONArray getTable(JSONObject json) throws JSONException {&nbsp; &nbsp; JSONArray table = (JSONArray) json.getJSONObject("context")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .getJSONObject("dispatcher")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .getJSONObject("stores")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .getJSONObject("QuoteSummaryStore")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .getJSONObject("incomeStatementHistoryQuarterly")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .getJSONArray("incomeStatementHistory");&nbsp; &nbsp; return table;}private String[] getRow(JSONArray table, String name) throws JSONException {&nbsp; &nbsp; String[] values = new String[table.length()];&nbsp; &nbsp; for (int i = 0; i < table.length(); i++) {&nbsp; &nbsp; &nbsp; &nbsp; JSONObject jo = table.getJSONObject(i);&nbsp; &nbsp; &nbsp; &nbsp; if (jo.has(name)) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; jo = jo.getJSONObject(name);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values[i] = jo.has("longFmt") ? jo.get("longFmt").toString() : "-";&nbsp; &nbsp; &nbsp; &nbsp; } else {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; values[i] = "-";&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return values;}private String[] getDates(JSONArray table) throws JSONException {&nbsp; &nbsp; String[] values = new String[table.length()];&nbsp; &nbsp; for (int i = 0; i < table.length(); i++) {&nbsp; &nbsp; &nbsp; &nbsp; values[i] = table.getJSONObject(i).getJSONObject("endDate")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; .get("fmt").toString();&nbsp; &nbsp; }&nbsp; &nbsp; return values;}我认为获取表数据的最佳方法是将每个 html 行名称映射到一个 json 键。此外,主表有五个子表,因此我们可以将每个嵌套表映射到它包含的行。Map<String, Map<String, String>> getTableNames() {&nbsp; &nbsp; final Map<String, String> revenue = new LinkedHashMap<String, String>() {&nbsp; &nbsp; &nbsp; &nbsp; { put("Total Revenue", "totalRevenue"); }&nbsp; &nbsp; &nbsp; &nbsp; { put("Cost of Revenue", "costOfRevenue"); }&nbsp; &nbsp; &nbsp; &nbsp; { put("Gross Profit", "grossProfit"); }&nbsp; &nbsp; };&nbsp; &nbsp; final Map<String, String> operatingExpenses = new LinkedHashMap<String, String>() {&nbsp; &nbsp; &nbsp; &nbsp; { put("Research Development", "researchDevelopment"); }&nbsp; &nbsp; &nbsp; &nbsp; { put("Selling General and Administrative", "sellingGeneralAdministrative"); }&nbsp; &nbsp; &nbsp; &nbsp; { put("Non Recurring", "nonRecurring"); }&nbsp; &nbsp; &nbsp; &nbsp; { put("Others", "otherOperatingExpenses"); }&nbsp; &nbsp; &nbsp; &nbsp; { put("Total Operating Expenses", "totalOperatingExpenses"); }&nbsp; &nbsp; &nbsp; &nbsp; { put("Operating Income or Loss", "operatingIncome"); }&nbsp; &nbsp; };&nbsp; &nbsp; Map<String, Map<String, String>> allTableNames = new LinkedHashMap<String, Map<String, String>>() {&nbsp; &nbsp; &nbsp; &nbsp; { put("Revenue", revenue); }&nbsp; &nbsp; &nbsp; &nbsp; { put("Operating Expenses", operatingExpenses); }&nbsp; &nbsp; };&nbsp; &nbsp; return allTableNames;}我们可以使用此地图选择单个单元格,例如 6/30/2018(位于第一行和第一列)的“总收入”,JSONObject jo = new JSONObject(jsData);JSONArray table = getTable(jo);Map<String, Map<String, String>> tableNames = getTableNames();String totalRevenueKey = tableNames.get("Revenue").get("Total Revenue");String[] totalRevenueValues = getRow(table, totalRevenueKey);String value = totalRevenueValues[0];或者我们可以遍历表名并构建一个包含所有表数据的列表或字符串。List<String> tableData = new ArrayList<>();Map<String, Map<String, String>> tableNames = getTableNames();String[] dates = getDates(table);for (Map.Entry<String, Map<String, String>> tableEntry : tableNames.entrySet()) {&nbsp; &nbsp; tableData.add(tableEntry.getKey());&nbsp; &nbsp; tableData.addAll(Arrays.asList(dates));&nbsp; &nbsp; for (Map.Entry<String, String> row : tableEntry.getValue().entrySet()) {&nbsp; &nbsp; &nbsp; &nbsp; String[] tableRow = getRow(table, row.getValue());&nbsp; &nbsp; &nbsp; &nbsp; tableData.add(row.getKey());&nbsp; &nbsp; &nbsp; &nbsp; for (String column: tableRow) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tableData.add(column);&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }}String tableDataString = TextUtils.join(", ", tableData);我尝试尽可能匹配 html 表,因此tableData列表和结果字符串的格式为“表名、日期、日期、日期、日期”和“行名、价格、价格、价格、价格” ,但最好只包含数字。(在这种情况下,我们应该只tableRow向 to添加项目tableData)
随时随地看视频慕课网APP

相关分类

Java
我要回答