猿问

golang HTML 字符集解码

我正在尝试解码utf-8 编码的HTML 页面。

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

有没有可以做到这一点的图书馆?我在网上找不到一个。

PS 当然,我可以使用 goquery 和 iconv-go 提取字符集并解码 HTML 页面,但我不想重新发明轮子。


拉丁的传说
浏览 217回答 2
2回答

扬帆大鱼

Golang 官方提供了扩展包:charset和encoding。下面的代码确保 HTML 包可以正确解析文档:func detectContentCharset(body io.Reader) string {&nbsp; &nbsp; r := bufio.NewReader(body)&nbsp; &nbsp; if data, err := r.Peek(1024); err == nil {&nbsp; &nbsp; &nbsp; &nbsp; if _, name, ok := charset.DetermineEncoding(data, ""); ok {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; return name&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return "utf-8"}// Decode parses the HTML body on the specified encoding and// returns the HTML Document.func Decode(body io.Reader, charset string) (interface{}, error) {&nbsp; &nbsp; if charset == "" {&nbsp; &nbsp; &nbsp; &nbsp; charset = detectContentCharset(body)&nbsp; &nbsp; }&nbsp; &nbsp; e, err := htmlindex.Get(charset)&nbsp; &nbsp; if err != nil {&nbsp; &nbsp; &nbsp; &nbsp; return nil, err&nbsp; &nbsp; }&nbsp; &nbsp; if name, _ := htmlindex.Name(e); name != "utf-8" {&nbsp; &nbsp; &nbsp; &nbsp; body = e.NewDecoder().Reader(body)&nbsp; &nbsp; }&nbsp; &nbsp; node, err := html.Parse(body)&nbsp; &nbsp; if err != nil {&nbsp; &nbsp; &nbsp; &nbsp; return nil, err&nbsp; &nbsp; }&nbsp; &nbsp; return node, nil}

交互式爱情

goquery可以满足您的需求。例如:import "https://github.com/PuerkitoBio/goquery"func main() {&nbsp; &nbsp; d, err := goquery.NewDocument("http://www.google.com")&nbsp; &nbsp; dh := d.Find("head")&nbsp; &nbsp; dc := dh.Find("meta[http-equiv]")&nbsp; &nbsp; c, err := dc.Attr("content") // get charset&nbsp; &nbsp; // ...}更多的操作可以在Document结构中找到。
随时随地看视频慕课网APP

相关分类

Go
我要回答