c# substring - 解析其间的所有文本

试图从下面的 html 代码中解析所有文本(主要是 url)。但我只想获取这些 div 标签 (result-firstline-title) 和 (result-url js-result-url) 之间的 url,用于每次(全部)事件。


需要明确的是,我能够从下面的 html 源代码中获取所有 url,但问题是它也几乎获取了 3 次 url。为此,我有一个修复程序可以删除重复的 url,但是,如果您仔细查看 html 源代码,您会发现它也获取了第三个 url。


<div class="result js-result card-mobile ">

<div class="result-firstline-container">

    <div class="result-firstline-title">

        <a

            class="result-title js-result-title"


            href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554"


        >

            The Top Social Networking Sites People Are Using

        </a>

    </div>


</div>


<a

    class="result-url js-result-url"


    href="https://www.lifewire.com/top-social-networking-sites-people-are-using-3486554">https://www.lifewire.com/top-<b>social-networking-sites</b>-people-are...

</a>

<p class="result-snippet">

    The Top

</p>

</div>


<div class="result js-result card-mobile ">

    <div class="result-firstline-container">

        <div class="result-firstline-title">

            <a

                class="result-title js-result-title"


                href="http://www.ebizmba.com/articles/social-networking- websites"


            >

                Top 15 Most Popular Social Networking Sites | January 2019

            </a>

        </div>


    </div>


    <a

        class="result-url js-result-url"


        href="http://www.ebizmba.com/articles/social-networking- websites">www.ebizmba.com/articles/<b>social-networking</b>-<b>websites</b>

    </a>

    <p class="result-snippet">

        Top 15 Most 

    </p>


</div>     

我尝试了以下 c# 代码来获取 div 标签之间的文本,但它获取了我不想要的所有内容。


        int urlTagFrom = rawHTMLFromSource.IndexOf("result-firstline-title") + "result-firstline-title".Length;

        int urlTagTo = rawHTMLFromSource.LastIndexOf("result-url js-result-url");

        urlTagCollection = rawHTMLFromSource.Substring(urlTagFrom, urlTagTo - urlTagFrom);

扬帆大鱼
浏览 100回答 1
1回答

qq_遁去的一_1

您可以使用 HTMLAgilityPack使其更容易,只需使用 NuGet 将其包含在您的项目中。使用 NuGet 添加 HTMLAgilityPack转到Package Manager Console并键入Install-Package HtmlAgilityPack -Version 1.11.3安装后,您可以像下面那样提取 Urls。var doc = new HtmlAgilityPack.HtmlDocument();doc.LoadHtml(@"put html string here");var listOfUrls = new List<string>();doc.DocumentNode.SelectNodes("//a").ToList()&nbsp; &nbsp;.ForEach(x=>&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;{&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; //Use HasClass method to filter elements&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (!string.IsNullOrEmpty(x.GetAttributeValue("href", ""))&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&& x.HasClass("result-title") && x.HasClass("js-result-title"))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;listOfUrls.Add(x.GetAttributeValue("href", ""));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;});listOfUrls.ForEach(x => Console.WriteLine(x));编辑添加&& x.HasClass("result-title") && x.HasClass("js-result-title")到仅显示那些具有类 result-title 和 js-result-title 的元素。其它的办法更短的另一种获取过滤值的方法。var doc = new HtmlAgilityPack.HtmlDocument();doc.LoadHtml(@"put html string here");var listOfUrls = doc.DocumentNode.Descendants("a")&nbsp; &nbsp; .Where(x => x.Attributes["class"] != null&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; && x.Attributes["class"].Value == "result-title js-result-title")&nbsp; &nbsp; .Select(x => x.GetAttributeValue("href", "")).ToList();
打开App,查看更多内容
随时随地看视频慕课网APP