猿问

Unmarshal HTML 嵌套在 XML 中

我从第三方收到一个xml文件,该文件在其中一个XML标记中具有HTML元素。我无法弄清楚如何解构它以获取href URL。


XML 示例:


<SOME_HTML>

  <a href="http://www.google.com" target="_blank">

  google</a>

</SOME_HTML>

到目前为止,这是我所达到的,但没有向结构中添加任何内容:


type Href struct {

    Link string `xml:"href"`

}

type Link struct {

    URL []Href `xml:"a"`

}

type XmlFile struct {

    HTMLTag []Link `xml:"SOME_HTML"`

}


myFile := []byte(`<?xml version="1.0" encoding="utf-8"?>

<SOME_HTML>

    <a href="http://www.google.com" target="_blank">

    google</a>

</SOME_HTML>`)


var output XmlFile

err := xml.Unmarshal(myFile, &output)

fmt.Println(output) // {[]}


长风秋雁
浏览 105回答 3
3回答

青春有我

你可以这样做(https://play.golang.org/p/MJzAVLBFfms):type aElement struct {&nbsp; &nbsp; Href string `xml:"href,attr"`}type content struct {&nbsp; &nbsp; A aElement `xml:"a"`}func main() {&nbsp; &nbsp; test := `<SOME_HTML><a href="http://www.google.com" target="_blank">google</a></SOME_HTML>`&nbsp; &nbsp; var result content&nbsp; &nbsp; if err := xml.Unmarshal([]byte(test), &result); err != nil {&nbsp; &nbsp; &nbsp; &nbsp; log.Fatal(err)&nbsp; &nbsp; }&nbsp; &nbsp; fmt.Println(result)}

潇湘沐

解析 xml 中的所有内容,假设 html 或其他标记(如 )中也可能有多个标记。adiv如果不需要这样做,只需替换为类型(不是XmlFile.LinksXmlFile.LinkLink[]Link)func main() {&nbsp; &nbsp; type Link struct {&nbsp; &nbsp; &nbsp; &nbsp; XMLName xml.Name `xml:"a"`&nbsp; &nbsp; &nbsp; &nbsp; URL&nbsp; &nbsp; &nbsp;string&nbsp; &nbsp;`xml:"href,attr"`&nbsp; &nbsp; &nbsp; &nbsp; Target&nbsp; string&nbsp; &nbsp;`xml:"target,attr"`&nbsp; &nbsp; &nbsp; &nbsp; Content string&nbsp; &nbsp;`xml:",chardata"`&nbsp; &nbsp; }&nbsp; &nbsp; type Div struct {&nbsp; &nbsp; &nbsp; &nbsp; XMLName xml.Name `xml:"div"`&nbsp; &nbsp; &nbsp; &nbsp; Classes string&nbsp; &nbsp;`xml:"class,attr"`&nbsp; &nbsp; &nbsp; &nbsp; Content string&nbsp; &nbsp;`xml:",chardata"`&nbsp; &nbsp; }&nbsp; &nbsp; type XmlFile struct {&nbsp; &nbsp; &nbsp; &nbsp; XMLName xml.Name `xml:"SOME_HTML"`&nbsp; &nbsp; &nbsp; &nbsp; Links&nbsp; &nbsp;[]Link&nbsp; &nbsp;`xml:"a"`&nbsp; &nbsp; &nbsp; &nbsp; Divs&nbsp; &nbsp; []Div&nbsp; &nbsp; `xml:"div"`&nbsp; &nbsp; }&nbsp; &nbsp; myFile := []byte(`<?xml version="1.0" encoding="utf-8"?><SOME_HTML>&nbsp; &nbsp; <a href="http://www.google.com" target="_blank">google</a>&nbsp; &nbsp; <a href="http://www.facebook.com" target="_blank">facebook</a>&nbsp; &nbsp; <div class="someclass">text</div></SOME_HTML>`)&nbsp; &nbsp; var output XmlFile&nbsp; &nbsp; err := xml.Unmarshal(myFile, &output)&nbsp; &nbsp; if err != nil {&nbsp; &nbsp; &nbsp; &nbsp; log.Fatal(err)&nbsp; &nbsp; }&nbsp; &nbsp; fmt.Println(output)}操场编辑:在 xml 中添加了更多标签,以显示如何解析不同的标签类型。

萧十郎

您可以使用常规XML解析器解析您发布的示例,但是XML语法存在许多例外,这些异常通常被接受为有效的HTML。我能想到的最简单的例子是:我所知道的所有html解释器都明白(未关闭的标签)与自关闭标签相同。<br><br><br />如果您不知道服务另一端的HTML是如何生成的,则最好使用HTML解析器。例如,有golang.go/x/net/html包,它提供了几个函数来解析HTML:https://play.golang.org/p/3hUogiwdRPOfunc findFirstHref(n *html.Node, indent string) string {&nbsp; &nbsp; if n.Type == html.ElementNode {&nbsp; &nbsp; &nbsp; &nbsp; fmt.Println("&nbsp; * scanning:" + indent + n.Data)&nbsp; &nbsp; }&nbsp; &nbsp; if n.Type == html.ElementNode && n.Data == "a" {&nbsp; &nbsp; &nbsp; &nbsp; for _, a := range n.Attr {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if a.Key == "href" {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return a.Val&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; for c := n.FirstChild; c != nil; c = c.NextSibling {&nbsp; &nbsp; &nbsp; &nbsp; href := findFirstHref(c, indent+"&nbsp; ")&nbsp; &nbsp; &nbsp; &nbsp; if href != "" {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return href&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return ""}func main() {&nbsp; &nbsp; doc1, err := html.Parse(strings.NewReader(sample1))&nbsp; &nbsp; if err != nil {&nbsp; &nbsp; &nbsp; &nbsp; fmt.Println(err)&nbsp; &nbsp; } else {&nbsp; &nbsp; &nbsp; &nbsp; fmt.Println("href in sample1:", findFirstHref(doc1, ""))&nbsp; &nbsp; }&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; doc2, err := html.Parse(strings.NewReader(sample2))&nbsp; &nbsp; if err != nil {&nbsp; &nbsp; &nbsp; &nbsp; fmt.Println(err)&nbsp; &nbsp; } else {&nbsp; &nbsp; &nbsp; &nbsp; fmt.Println("href in sample2:", findFirstHref(doc2, ""))&nbsp; &nbsp; }}const (&nbsp; &nbsp; sample1 = `<?xml version="1.0" encoding="utf-8"?><SOME_HTML>&nbsp; &nbsp; <a href="http://www.google.com" target="_blank">&nbsp; &nbsp; google</a></SOME_HTML>`&nbsp; &nbsp; // sample2 is an invalid XML document (it has unclosed "<br>" tags):&nbsp; &nbsp; sample2 = `&nbsp; &nbsp; <p> line1 <br> line2<a href="foobar" target="_blank">&nbsp; Some <br> text</a></p>`)
随时随地看视频慕课网APP

相关分类

Go
我要回答