猿问

在 Java 中使用 Scanner 进行网页抓取

我应该使用 URL 和扫描程序类进行网络抓取,并从网站上的 HTML 代码中仅找出过去 8 天的能源消耗量。所以我有一个 24x8 的数组来适应所有的数字。我正在使用 .findInLine 来识别小时前:我在这里使用第一个部分来识别第一个小时的数字块。


while (in.findInLine("00-01") == null) in.nextLine();

in.nextLine() // skip rest of the line containing "00-01"


<td>00-01</td>

<td align="right"> 11872</td>

<td align="right"> 12146</td>

<td align="right"> 12861</td>

<td align="right"> 12561</td>

<td align="right"> 13493</td>

<td align="right"> 13386</td>

<td align="right"> 12732</td>

<td align="right"> <b>12249</b></td>

我的问题是我不知道如何提取这些数字并将它们放入数组中,因为我有 24 个这些部分。


HUH函数
浏览 157回答 2
2回答

皈依舞

给定输入,以下将提取每行的数字。&nbsp; Pattern pattern = Pattern.compile("\\d+");&nbsp; &nbsp; while (in.hasNext())&nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; String str = in.nextLine();&nbsp; &nbsp; &nbsp; Matcher m = pattern.matcher(str);&nbsp; &nbsp; &nbsp; while (m.find())&nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; //Change this to add to add to an array&nbsp; &nbsp; &nbsp; &nbsp; System.out.println(m.group());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }

慕哥6287543

鉴于您的输入有限,我仅使用纯扫描仪界面就做到了:public class Scrap {private final static String HOUR_PATTERN = "<td>\\d{2}-\\d{2}</td>";private final static String TD_DELIMETER = "\\s|<|>";public static void main(String[] args) {&nbsp; &nbsp; Scanner in = new Scanner(Scrap.class.getResourceAsStream("/input"));&nbsp; &nbsp; List<Integer> res = new ArrayList<>();&nbsp; &nbsp; while (in.hasNext()) {&nbsp; &nbsp; &nbsp; &nbsp; if (!in.hasNext(HOUR_PATTERN)) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println(in.next());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; continue;&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; String found = in.next(HOUR_PATTERN);&nbsp; &nbsp; &nbsp; &nbsp; Pattern delim = in.delimiter();&nbsp; &nbsp; &nbsp; &nbsp; in.useDelimiter(TD_DELIMETER);&nbsp; &nbsp; &nbsp; &nbsp; for (int i = 0; i < 8; i++) {// you wrote it is going to be 8 entries&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; while (in.hasNext()) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (in.hasNextInt()) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; res.add(in.nextInt());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; } else {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; System.out.println(in.next());&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; in.useDelimiter(delim);&nbsp; &nbsp; }&nbsp; &nbsp; System.out.println(res);}}给定输入blelblebll<td>00-01</td><td align="right"> 11872</td><td align="right"> 12146</td><td align="right"> 12861</td><td align="right"> 12561</td><td align="right"> 13493</td><td align="right"> 13386</td><td align="right"> 12732</td><td align="right"> <b>12249</b></td><td>00-01</td><td align="right"> 11872</td><td align="right"> 12146</td><td align="right"> 12861</td><td align="right"> 12561</td><td align="right"> 13493</td><td align="right"> 13386</td><td align="right"> 12732</td><td align="right"> <b>12249</b></td><td>00-01</td><td align="right"> 11872</td><td align="right"> 12146</td><td align="right"> 12861</td><td align="right"> 12561</td><td align="right"> 13493</td><td align="right"> 13386</td><td align="right"> 12732</td><td align="right"> <b>12249</b></td><td>00-01</td><td align="right"> 11872</td><td align="right"> 12146</td><td align="right"> 12861</td><td align="right"> 12561</td><td align="right"> 13493</td><td align="right"> 13386</td><td align="right"> 12732</td><td align="right"> <b>12249</b></td>生产[11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249, 11872, 12146, 12861, 12561, 13493, 13386, 12732, 12249]它基于您的输入示例,因此它现在可能适用于实时标记。或者,您可以将<.*?>其用作分隔符并仅关注数字模式。
随时随地看视频慕课网APP

相关分类

Java
我要回答