Popularity
3.8
Growing
Activity
7.3
Declining
63
5
6

Programming language: Go

goribot alternatives and similar packages

Based on the "Specific Formats" category

Do you think we are missing an alternative of goribot or a related project?

Add another 'Specific Formats' Package

README

Goribot

A golang spider framework.

[中文文档](README_zh.md)

Codecov go-report license code-size FOSSA Status

Features

  • Clean API
  • Caching
  • Extensions
  • Pipeline-style handle logic
  • Robots.txt support (use RobotsTxt extensions)
  • Request Deduplicate (use ReqDeduplicate extensions)

Example

a basic example:

package main

import (
    "fmt"
    "github.com/zhshch2002/goribot"
)

func main() {
    s := goribot.NewSpider()
    s.NewTask(
        goribot.MustNewGetReq("https://httpbin.org/get?Goribot%20test=hello%20world"),
        func(ctx *goribot.Context) {
            fmt.Println("got resp data", ctx.Text)
        })
    s.Run()
}

a complete bilibili.com video spider example

Start to use

install

go get -u github.com/zhshch2002/goribot

basic use

create spider

s := goribot.NewSpider()

you can also init the spider by extensions,like the RandomUserAgent extension:

s := NewSpider(RandomUserAgent())

New task

create a request:

req:=goribot.MustNewGetReq("https://httpbin.org/get?Goribot%20test=hello")
// or req,err := goribot.NewGetReq("https://httpbin.org/get?Goribot%20test=hello")

// config the request
req.Header.Set("test", "test")
req.Cookie = append(req.Cookie, &http.Cookie{
    Name:  "test",
    Value: "test",
})
req.Proxy = "http://127.0.0.1:1080"

Add the request to spider task queue:

var thirdHandler func(*goribot.Context)
thirdHandler= func(ctx *goribot.Context) {
    //bu la bu la,do sth
}

s.NewTask(
    req, // the request you have created
    func(ctx *goribot.Context) {
        // first handler
        fmt.Println("got resp data", ctx.Text)
    },
    func(ctx *goribot.Context) { // you can set a group of handler func as a chain,or set same func for different request task.
    // second handler
        fmt.Println("got resp data", ctx.Text)
    },
    thirdHandler,
)

Context

Context is the only param the handler get.You can get the http response or the origin request from it,in addition you can use ctx send new request task to spider.

type Context struct {
    Text string // the response text
    Html *goquery.Document // spider will try to parse the response as html
    Json map[string]interface{} // spider will try to parse the response as json

    Request  *Request // origin request
    Response *Response // a response object

    Tasks []*Task // the new request task which will send to the spider
    Items []interface{} // the new result data which will send to the spider,use to store
    Meta  map[string]interface{} // the request task created by NewTaskWithMeta func will have a k-y pair

    drop bool // in handlers chain,you can use ctx.Drop() to break the handler chain and stop handling
}

create new task inside of handle fun or with meta data:

s.NewTaskWithMeta(MustNewGetReq("https://httpbin.org/get"), map[string]interface{}{
    "test": 1,
}, func(ctx *Context) {
    fmt.Println(ctx.Meta["test"]) // get the meta data

    // waring: here is the ctx.NewTaskWithMeta func rather than s.NewTaskWithMeta!
    ctx.NewTaskWithMeta(MustNewGetReq("https://httpbin.org/get"), map[string]interface{}{
        "test": 2,
    }, func(ctx *Context) {
        fmt.Println(ctx.Meta["test"]) // get the meta data
    })
})

Tip:It is different between s.NewTaskWithMeta and ctx.NewTaskWithMeta,when you use the extensions or spider hook func.

Run it!

Call the s.Run() to run the spider.

use the hook func and make extensions

wait to write.

Another Example

A bilibili video spider:

package main

import (
    "github.com/PuerkitoBio/goquery"
    "github.com/zhshch2002/goribot"
    "log"
    "strings"
)

type BiliVideoItem struct {
    Title, Url string
}

func main() {
    s := goribot.NewSpider(goribot.HostFilter("www.bilibili.com"), goribot.ReqDeduplicate(), goribot.RandomUserAgent())
    s.DepthFirst = false
    s.ThreadPoolSize = 1

    var biliVideoHandler, getNewLinkHandler func(ctx *goribot.Context)

    getNewLinkHandler = func(ctx *goribot.Context) {
        ctx.Html.Find("a[href]").Each(func(i int, selection *goquery.Selection) {
            rawurl, _ := selection.Attr("href")
            if !strings.HasPrefix(rawurl, "/video/av") {
                return
            }
            u, err := ctx.Request.Url.Parse(rawurl)
            if err != nil {
                return
            }
            u.RawQuery = ""
            if strings.HasSuffix(u.Path, "/") {
                u.Path = u.Path[0 : len(u.Path)-1]
            }
            //log.Println(u.String())
            if r, err := goribot.NewGetReq(u.String()); err == nil {
                ctx.NewTask(r, getNewLinkHandler, biliVideoHandler)
            }
        })
    }

    biliVideoHandler = func(ctx *goribot.Context) {
        ctx.AddItem(BiliVideoItem{
            Title: ctx.Html.Find("title").Text(),
            Url:   ctx.Request.Url.String(),
        })
    }

    s.NewTask(goribot.MustNewGetReq("https://www.bilibili.com/video/av66703342"), getNewLinkHandler, biliVideoHandler)


    s.OnItem(func(ctx *goribot.Context, i interface{}) interface{} {
        log.Println(i) // 可以做一些数据存储工作
        return i
    })

    s.Run()
}

License

FOSSA Status


*Note that all licence references and agreements mentioned in the goribot README section above are relevant to that project's source code only.