肥皂起泡泡
雅虎财经重定向到 guce.oath.com,它会通知我们有关 cookie 和其他数据的使用,并要求在提供内容之前单击“接受”。如果我们清除 cokies 并刷新页面,我们也可以在浏览器中观察到。我们可以从 guce.oath.com 抓取链接,但我注意到最终 URL 有一个guccounter=2参数,如果我们使用该 URL,我们可以获得所需的响应。String requestURL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&guccounter=2";String userAgent = "My UAString";Document doc = Jsoup.connect(requestURL).userAgent(userAgent).get();由于数据不是 HTML 而是 JavaScript 代码,我们不能用 解析它jsoup,但我们可以使用正则表达式。Elements scriptTags = doc.getElementsByTag("script");String re = "root\\.App\\.main\\s*\\=\\s*(.*?);\\s*\\}\\(this\\)\\)\\s*;";String data = null;for (Element script : scriptTags) { Pattern pattern = Pattern.compile(re, Pattern.DOTALL); Matcher matcher = pattern.matcher(script.html()); if (matcher.find()) { data = matcher.group(1); break; }}该data字符串应包含 JavaScript 代码中的字典,这是一个有效的 json 字符串,可以使用JSONObject.但是,在 Android Studio 上,据我所知,没有重定向。我尝试了几个用户代理字符串,但似乎页面是直接加载的。尽管如此,包含数据的 JavaScript 字典仍然存在,我们可以提取它,并使用JSONObject.Android Studio 代码:String requestURL = "https://finance.yahoo.com/quote/AAPL/financials?p=AAPL";String userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/56.0.3051.43";String row = "totalRevenue";try { Document doc = Jsoup.connect(requestURL).userAgent(userAgent).get(); String html = doc.html(); //Log.d("html", html); Elements scriptTags = doc.getElementsByTag("script"); String re = "root\\.App\\.main\\s*\\=\\s*(.*?);\\s*\\}\\(this\\)\\)\\s*;"; for (Element script : scriptTags) { Pattern pattern = Pattern.compile(re, Pattern.DOTALL); Matcher matcher = pattern.matcher(script.html()); if (matcher.find()) { String data = matcher.group(1); //Log.d("data", data); JSONObject jo = new JSONObject(data); JSONArray table = getTable(jo); //Log.d("table", table.toString()); String[] tableRow = getRow(table, row); String values = TextUtils.join(", ", tableRow); Log.d("values", values); } }} catch (Exception e) { Log.e("err", "err", e);}这应该解析数据并选择“总收入”值。我使用的getTable和getRow方法:private JSONArray getTable(JSONObject json) throws JSONException { JSONArray table = (JSONArray) json.getJSONObject("context") .getJSONObject("dispatcher") .getJSONObject("stores") .getJSONObject("QuoteSummaryStore") .getJSONObject("incomeStatementHistoryQuarterly") .getJSONArray("incomeStatementHistory"); return table;}private String[] getRow(JSONArray table, String name) throws JSONException { String[] values = new String[table.length()]; for (int i = 0; i < table.length(); i++) { JSONObject jo = table.getJSONObject(i); if (jo.has(name)) { jo = jo.getJSONObject(name); values[i] = jo.has("longFmt") ? jo.get("longFmt").toString() : "-"; } else { values[i] = "-"; } } return values;}private String[] getDates(JSONArray table) throws JSONException { String[] values = new String[table.length()]; for (int i = 0; i < table.length(); i++) { values[i] = table.getJSONObject(i).getJSONObject("endDate") .get("fmt").toString(); } return values;}我认为获取表数据的最佳方法是将每个 html 行名称映射到一个 json 键。此外,主表有五个子表,因此我们可以将每个嵌套表映射到它包含的行。Map<String, Map<String, String>> getTableNames() { final Map<String, String> revenue = new LinkedHashMap<String, String>() { { put("Total Revenue", "totalRevenue"); } { put("Cost of Revenue", "costOfRevenue"); } { put("Gross Profit", "grossProfit"); } }; final Map<String, String> operatingExpenses = new LinkedHashMap<String, String>() { { put("Research Development", "researchDevelopment"); } { put("Selling General and Administrative", "sellingGeneralAdministrative"); } { put("Non Recurring", "nonRecurring"); } { put("Others", "otherOperatingExpenses"); } { put("Total Operating Expenses", "totalOperatingExpenses"); } { put("Operating Income or Loss", "operatingIncome"); } }; Map<String, Map<String, String>> allTableNames = new LinkedHashMap<String, Map<String, String>>() { { put("Revenue", revenue); } { put("Operating Expenses", operatingExpenses); } }; return allTableNames;}我们可以使用此地图选择单个单元格,例如 6/30/2018(位于第一行和第一列)的“总收入”,JSONObject jo = new JSONObject(jsData);JSONArray table = getTable(jo);Map<String, Map<String, String>> tableNames = getTableNames();String totalRevenueKey = tableNames.get("Revenue").get("Total Revenue");String[] totalRevenueValues = getRow(table, totalRevenueKey);String value = totalRevenueValues[0];或者我们可以遍历表名并构建一个包含所有表数据的列表或字符串。List<String> tableData = new ArrayList<>();Map<String, Map<String, String>> tableNames = getTableNames();String[] dates = getDates(table);for (Map.Entry<String, Map<String, String>> tableEntry : tableNames.entrySet()) { tableData.add(tableEntry.getKey()); tableData.addAll(Arrays.asList(dates)); for (Map.Entry<String, String> row : tableEntry.getValue().entrySet()) { String[] tableRow = getRow(table, row.getValue()); tableData.add(row.getKey()); for (String column: tableRow) { tableData.add(column); } }}String tableDataString = TextUtils.join(", ", tableData);我尝试尽可能匹配 html 表,因此tableData列表和结果字符串的格式为“表名、日期、日期、日期、日期”和“行名、价格、价格、价格、价格” ,但最好只包含数字。(在这种情况下,我们应该只tableRow向 to添加项目tableData)