猿问

如何通过html内容获取href和文本内容

我想要获取内容和网址,包括所有其他 td 数据。


我的代码:


$context = stream_context_create(

    array(

        "http" => array(

            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"

        )

    )

);


$htmlContent = file_get_contents("https://www.iana.org/domains/root/db", false, $context);

    

$DOM = new DOMDocument();

$DOM->loadHTML($htmlContent);


$FirstdTable = $DOM->getElementsByTagName('table')->item(0);



$Header = $FirstdTable->getElementsByTagName('th');

$Detail = $FirstdTable->getElementsByTagName('td');


//#Get header name of the table

foreach($Header as $NodeHeader) 

{

    $aDataTableHeaderHTML[] = trim($NodeHeader->textContent);

}


//#Get row data/detail table without header name as key

$i = 0;

$j = 0;

foreach($Detail as $sNodeDetail)

{

   

    $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);

    $i = $i + 1;

    $j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;

}

电流输出:


Array

(

    [0] => Array

        (

            [0] => .aaa

            [1] => generic

            [2] => American Automobile Association, Inc.

        )


    [1] => Array

        (

            [0] => .aarp

            [1] => generic

            [2] => AARP

        )


    [2] => Array

        (

            [0] => .abarth

            [1] => generic

            [2] => Fiat Chrysler Automobiles N.V.

        )

}

我在这里想要:


Array

(

    [0] => Array

        (

            [0] => .aaa

            [1] => generic

            [2] => American Automobile Association, Inc.

            [3] => https://www.iana.org/domains/root/db/aaa.html

        )


    [1] => Array

        (

            [0] => .aarp

            [1] => generic

            [2] => AARP

            [3] => https://www.iana.org/domains/root/db/aarp.html

        )


    [2] => Array

        (

            [0] => .abarth

            [1] => generic

            [2] => Fiat Chrysler Automobiles N.V.

            [3] => https://www.iana.org/domains/root/db/abarth.html

        )

}


一只甜甜圈
浏览 93回答 1
1回答

凤凰求蛊

目前,您只是获取 all 中的所有文本内容<td>。并且它不会将链接包含在锚标记内。为此,您需要更深入地研究<td>.这是使用以下方法来完成此操作的一种方法xpath:$xpath = new DOMXpath($DOM);$base = 'https://www.iana.org/';foreach($Detail as $sNodeDetail){&nbsp; &nbsp; $aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);&nbsp; &nbsp; if ($link = $xpath->evaluate("string(./span[contains(@class, 'domain')]/a/@href)", $sNodeDetail)) {&nbsp; &nbsp; &nbsp; &nbsp; $aDataTableDetailHTML[$j][] = "{$base}{$link}";&nbsp; &nbsp; }&nbsp; &nbsp; $i = $i + 1;&nbsp; &nbsp; $j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;}基本上,查询只是提取href当前<td>迭代中的值<span class="domain tld"><a href="xxxx">xxx</a></span>并获取该href值。另一种方法是迭代每个<tr>而不是每个<td>:$aDataTableDetailHTML = [];$DOM = new DOMDocument();$DOM->loadHTML($htmlContent);$xpath = new DOMXpath($DOM);$base = 'https://www.iana.org/';foreach($xpath->query('//table[@id="tld-table"]/tbody/tr') as $row) {&nbsp; &nbsp; $domain = trim($xpath->evaluate("string(./td[1])", $row));&nbsp; &nbsp; $type = $xpath->evaluate("string(./td[2])", $row);&nbsp; &nbsp; $tld_manager = $xpath->evaluate("string(./td[3])", $row);&nbsp; &nbsp; $url = $xpath->evaluate("string(./td[1]/span/a/@href)", $row);&nbsp; &nbsp; $aDataTableDetailHTML[] = [$domain, $type, $tld_manager, "{$base}{$url}"];}
随时随地看视频慕课网APP
我要回答