XML 解析返回带有换行符的字符串

我正在尝试通过站点地图解析 XML,然后遍历地址以获取 Go 中帖子的详细信息。但是我收到了这个奇怪的错误:


: URL 中的第一个路径段不能包含冒号


这是代码片段:


type SitemapIndex struct {

    Locations []Location `xml:"sitemap"`

}


type Location struct {

    Loc string `xml:"loc"`

}


func (l Location) String() string {

    return fmt.Sprintf(l.Loc)

}


func main() {

    resp, _ := http.Get("https://www.washingtonpost.com/news-sitemaps/index.xml")

    bytes, _ := ioutil.ReadAll(resp.Body)

    var s SitemapIndex

    xml.Unmarshal(bytes, &s)

    for _, Location := range s.Locations {

        fmt.Printf("Location: %s", Location.Loc)

        resp, err := http.Get(Location.Loc)

        fmt.Println("resp", resp)

        fmt.Println("err", err)

    }

}

输出:


Location: 

https://www.washingtonpost.com/news-sitemaps/politics.xml

resp <nil>

err parse 

https://www.washingtonpost.com/news-sitemaps/politics.xml

: first path segment in URL cannot contain colon

Location: 

https://www.washingtonpost.com/news-sitemaps/opinions.xml

resp <nil>

err parse 

https://www.washingtonpost.com/news-sitemaps/opinions.xml

: first path segment in URL cannot contain colon

...

...

我的猜测是Location.Loc在实际地址之前和之后返回一个新行。例如:\nLocation: https://www.washingtonpost.com/news-sitemaps/politics.xml\n


因为硬编码 URL 按预期工作:


for _, Location := range s.Locations {

        fmt.Printf("Location: %s", Location.Loc)

        test := "https://www.washingtonpost.com/news-sitemaps/politics.xml"

        resp, err := http.Get(test)

        fmt.Println("resp", resp)

        fmt.Println("err", err)

    }


但是我是 Go 的新手,所以我不知道出了什么问题。你能告诉我我哪里错了吗?


MM们
浏览 201回答 2
2回答

子衿沉夜

您确实是对的,问题来自换行符。如您所见,您在使用时Printf没有添加任何内容\n,并且在输出的开头添加了一个,在输出的结尾添加了一个。您可以使用strings.Trim删除这些换行符。这是一个使用您尝试解析的站点地图的示例。修剪字符串后,您将能够http.Get毫无错误地调用它。func main() {    var s SitemapIndex    xml.Unmarshal(bytes, &s)    for _, Location := range s.Locations {        loc := strings.Trim(Location.Loc, "\n")        fmt.Printf("Location: %s\n", loc)    }}如预期的那样,此代码正确输出没有任何换行符的位置:Location: https://www.washingtonpost.com/news-sitemaps/politics.xmlLocation: https://www.washingtonpost.com/news-sitemaps/opinions.xmlLocation: https://www.washingtonpost.com/news-sitemaps/local.xmlLocation: https://www.washingtonpost.com/news-sitemaps/sports.xmlLocation: https://www.washingtonpost.com/news-sitemaps/national.xmlLocation: https://www.washingtonpost.com/news-sitemaps/world.xmlLocation: https://www.washingtonpost.com/news-sitemaps/business.xmlLocation: https://www.washingtonpost.com/news-sitemaps/technology.xmlLocation: https://www.washingtonpost.com/news-sitemaps/lifestyle.xmlLocation: https://www.washingtonpost.com/news-sitemaps/entertainment.xmlLocation: https://www.washingtonpost.com/news-sitemaps/goingoutguide.xml字段中有这些换行符的原因Location.Loc是此 URL 返回的 XML。条目遵循这种形式:<sitemap><loc>https://www.washingtonpost.com/news-sitemaps/goingoutguide.xml</loc></sitemap>正如您所看到的,元素中的内容前后都有换行符loc。

BIG阳

查看修改代码中嵌入的注释以描述和修复问题func main() {&nbsp; &nbsp; resp, _ := http.Get("https://www.washingtonpost.com/news-sitemaps/index.xml")&nbsp; &nbsp; bytes, _ := ioutil.ReadAll(resp.Body)&nbsp; &nbsp; var s SitemapIndex&nbsp; &nbsp; xml.Unmarshal(bytes, &s)&nbsp; &nbsp; for _, Location := range s.Locations {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Note that %v shows that there are indeed newlines at beginning and end of Location.Loc&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; fmt.Printf("Location: (%v)", Location.Loc)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // solution: use strings.TrimSpace to remove newlines from Location.Loc&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; resp, err := http.Get(strings.TrimSpace(Location.Loc))&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; fmt.Println("resp", resp)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; fmt.Println("err", err)&nbsp; &nbsp; }}
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Go