- HTTP tracing support
- New callback: OnResponseHeader
- Queue fixes
- New collector option: Collector.CheckHead
- Proxy fixes
- Fixed POST revisit checking
- Updated dependencies
v2.0.1January 03, 2020
- Breaking change: Change Collector.RedirectHandler member to Collector.SetRedirectHandler function
- Go module support
- Collector.HasVisited method added to be able to check if an url has been visited
- Collector.SetClient method introduced
- HTMLElement.ChildTexts method added
- New user agents
- Multiple bugfixes
- Compatibility with the latest htmlquery package
- New request shortcut for HEAD requests
- Check URL availibility before visiting
- Fix proxy URL value
- Request counter fix
- Minor fixes in examples
- Appengine integration takes context.Context instead of http.Request (API change)
- ➕ Added "Accept" http header by default to every request
- 👌 Support slices of pointers and structs in unmarshal
- 🛠 Fixed a race condition in queues
- ForEachWithBreak method added to HTMLElement
- ➕ Added a local file example
- 👌 Support gzip decompression of response bodies
- Don't share waitgroup when cloning a collector
- 🛠 Fixed instagram example
🚀 We are happy to announce that the first major release of Colly is here. Our goal was to create a scraping framework to speed up development and let its users concentrate on collecting relevant data. There is no need to reinvent the wheel when writing a new collector. Scrapers built on top of Colly support different storage backends, dynamic configuration and running requests in parallel out of the box. It is also possible to run your scrapers in a distributed manner.
Facts about the development
It started in September 2017 and has not stopped since. Colly has attracted numerous developers who helped by providing valuable feedback and contributing new features. Let's see the numbers. In the last seven months 30 contributors have created 338 commits. Users have opened 78 issues. 74 of the those were resolved in a few days. Contributors have opened 59 pull requests and all of them except for one are either got merged or closed. We would like to thank all of our supporters who either contributed code or wrote blog posts about Colly or helped development somehow. We would not be here without you.
🚀 You might ask why it is released now. Our experience in various deployments in production shows Colly provides a stable and robust platform for developing and running scrapers both locally and in multi server configuration. The feature set is complete and ready to support even complex use cases. What are those features?
Rate limiting During scraping controlling the number of request sent to the scraped site might be crucial. We would not want to disrupt the service by overloading with too many requests. It is bad for the operators of the site and also for us, because the data we would like to collect becomes inaccessible. Thus, request number must be limited. The collector provided by Colly can be configured to send only a limited number of requests in parallel.
Request caching To relieve the load from external services and decrease the number of outgoing requests response caching is supported.
🔧 Configurable via environment variables To eliminate rebuilding of your scraper during fine-tuning, Colly can read configuration options from environment variables. So you can modify its settings without a Golang development environment.
Proxies/proxy switchers If the address of scrapers has to be hidden proxies can be added to make requests instead of the machine running the scraping job. Furthermore, to scale Colly without running multiple scraper instances, proxy switchers can be used. Collectors support proxy switchers which can distribute requests among multiple servers. Scraping collected sites is still done on the machine running the scrapers. But the network traffic is moved to different hosts.
Storage backend and storage interface During scraping a various data needs to be stored and sometimes shared. To access these objects Colly provides a storage interface. You can create your own storages and use it in your scraper by implementing the interface required. By default Colly saves everything into memory. Additional Colly backend implementations are available for Redis and SQLite3.
Request queue Scraping pages in parallel asynchronously is a must have feature when scraping. Colly maintains a request queue where URLs found during scraping are collected. Worker threads of your collector are taking these URLs and creating requests.
Goodies The package named
extensionsprovides multiple helpers for collectors. These are common functions implemented in advance, so you don't have to bloat your scraper code with general implementations. An example extension is
RandomUserAgentwhich generates a random User Agent for every request. You can find the full list of Goodies: https://godoc.org/github.com/gocolly/colly/extensions
Debuggers Debugging can be painful. Colly tries to ease the pain by providing
Debuggersto inspect your scraper. You can simply write debug messages to the console by using
LogDebugger. If you prefer web interfaces, we've got you covered. Colly comes with a web debugger. You can use it by initializing a
WebDebugger. See here how debuggers can be used: https://godoc.org/github.com/gocolly/colly/debug
👍 We the team behind Colly believe that it has become a stable and mature scraping framework capable of supporting complex use cases. We are hoping for an even more productive future. Last but not least thank you for your support and contributions.