猿问

go-colly:如何在 c.OnResponse 中获取 HTML 标题,以便填充结构?

如何在 c.OnResponse 中获取 HTML.title - 或者是否有更好的替代方法来用 url/title/content 填充 Struct


最后,我需要填写以下结构并将其发布到 elasticsearch。

type WebPage struct {

    Url     string `json:"url"`

    Title   string `json:"title"`

    Content string `json:"content"`

}

    // Print the response

    c.OnResponse(func(r *colly.Response) {

        pageCount++

        log.Println(r.Headers)



        webpage := WebPage{

            Url:     r.Ctx.Get("url"), //- can be put in ctx c.OnRequest, and r.Ctx.Get("url")

            Title:   "my title", //string(r.title), // Where to get this?

            Content: string(r.Body),  //string(r.Body) - can be done in c.OnResponse

        }


        enc := json.NewEncoder(os.Stdout)

        enc.SetIndent("", "  ")

        enc.Encode(webpage) // SEND it to elasticsearch 


        log.Println(fmt.Sprintf("%d  DONE Visiting : %s", pageCount, urlVisited))


    })



我可以通过如下方法获得标题,但是 Ctx 不可用,因此我无法将“标题”值放入 Ctx。其他选择?


    c.OnHTML("title", func(e *colly.HTMLElement) {

        fmt.Println(e.Text)

        e.Ctx.Put("title", e.Text) // NOT ACCESSIBLE!

    })

日志


2020/05/07 17:42:37 7  DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css

{

  "url": "https://www.coursera.org/learn/build-portfolio-website-html-css",

  "title": "my page title",

  "content": "page html body bla "

}

2020/05/07 17:42:37 8  DONE Visiting : https://www.coursera.org/learn/build-portfolio-website-html-css

{

  "url": "https://www.coursera.org/browse/social-sciences",

  "title": "my page title",

  "content": "page html body bla "

}


鸿蒙传说
浏览 109回答 2
2回答

汪汪一只猫

我创建了该结构的全局变量并不断用不同的方法填充它不确定这是否是最好的方法。fun  main(){....    webpage := WebPage{} //Is this a right way to declare a mutable struct?    c.OnRequest(func(r *colly.Request) { // url        webpage.Url = r.URL.String() // Is this the right way to mutate?    })    c.OnResponse(func(r *colly.Response) { //get body        pageCount++        log.Println(fmt.Sprintf("%d  DONE Visiting : %s", pageCount, webpage.Url))    })    c.OnHTML("head title", func(e *colly.HTMLElement) { // Title        webpage.Title = e.Text    })    c.OnHTML("html body", func(e *colly.HTMLElement) { // Body / content        webpage.Content = e.Text  // Can url title body be misrepresented in multithread scenario?    })    c.OnHTML("a[href]", func(e *colly.HTMLElement) { // href , callback        link := e.Attr("href")        e.Request.Visit(link)    })    c.OnError(func(r *colly.Response, err error) { // Set error handler        log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)    })    c.OnScraped(func(r *colly.Response) { // DONE        enc := json.NewEncoder(os.Stdout)        enc.SetIndent("", "  ")        enc.Encode(webpage)    })

心有法竹

我基于 Espresso 的回答...c2.OnHTML("html", func(html *colly.HTMLElement) {    slug := strings.Split(html.Request.URL.String(), "/")[4]    title := ""    descr := ""    h1    := ""    html.ForEach("head", func(_ int, head *colly.HTMLElement) {        title += head.ChildText("title")        head.ForEach("meta", func(_ int, meta *colly.HTMLElement) {            if meta.Attr("name") == "description" {                descr += meta.Attr("content")            }        })    })    html.ForEach("h1", func(_ int, h1El *colly.HTMLElement) {        h1 += h1El.Text    })    //Now you can do stuff with your elements from head and body})
随时随地看视频慕课网APP

相关分类

Go
我要回答