htmlquery alternatives and similar packages
Based on the "Specific Formats" category.
Alternatively, view htmlquery alternatives based on common mentions on social networks and blogs.
-
bluemonday
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS -
html-to-markdown
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules. -
omniparser
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc. -
mxj
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages. -
go-pkg-rss
DISCONTINUED. This package reads RSS and Atom feeds and provides a caching mechanism that adheres to the feed specs. -
goq
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library -
github_flavored_markdown
GitHub Flavored Markdown renderer with fenced code block highlighting, clickable header anchor links. -
go-pkg-xmlx
DISCONTINUED. Extension to the standard Go XML package. Maintains a node tree that allows forward/backwards browsing and exposes some simple single/multi-node search functions. -
pagser
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler -
csvplus
csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.
SaaSHub - Software Alternatives and Reviews
Do you think we are missing an alternative of htmlquery or a related project?
README
htmlquery
Overview
htmlquery
is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.
htmlquery
built-in the query object caching feature based on LRU, this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.
You can visit this page to learn about the supported XPath(1.0/2.0) syntax. https://github.com/antchfx/xpath
XPath query packages for Go
Name | Description |
---|---|
htmlquery | XPath query package for the HTML document |
xmlquery | XPath query package for the XML document |
jsonquery | XPath query package for the JSON document |
Installation
go get github.com/antchfx/htmlquery
Getting Started
Query, returns matched elements or error.
nodes, err := htmlquery.QueryAll(doc, "//a")
if err != nil {
panic(`not a valid XPath expression.`)
}
Load HTML document from URL.
doc, err := htmlquery.LoadURL("http://example.com/")
Load HTML from document.
filePath := "/home/user/sample.html"
doc, err := htmlquery.LoadDoc(filePath)
Load HTML document from string.
s := `<html>....</html>`
doc, err := htmlquery.Parse(strings.NewReader(s))
Find all A elements.
list := htmlquery.Find(doc, "//a")
Find all A elements that have href
attribute.
list := htmlquery.Find(doc, "//a[@href]")
Find all A elements with href
attribute and only return href
value.
list := htmlquery.Find(doc, "//a/@href")
for _ , n := range list{
fmt.Println(htmlquery.SelectAttr(n, "href")) // output @href value
}
Find the third A element.
a := htmlquery.FindOne(doc, "//a[3]")
Find children element (img) under A href
and print the source
a := htmlquery.FindOne(doc, "//a")
img := htmlquery.FindOne(a, "//img")
fmt.Prinln(htmlquery.SelectAttr(img, "src")) // output @src value
Evaluate the number of all IMG element.
expr, _ := xpath.Compile("count(//img)")
v := expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f", v)
Quick Starts
func main() {
doc, err := htmlquery.LoadURL("https://www.bing.com/search?q=golang")
if err != nil {
panic(err)
}
// Find all news item.
list, err := htmlquery.QueryAll(doc, "//ol/li")
if err != nil {
panic(err)
}
for i, n := range list {
a := htmlquery.FindOne(n, "//a")
if a != nil {
fmt.Printf("%d %s(%s)\n", i, htmlquery.InnerText(a), htmlquery.SelectAttr(a, "href"))
}
}
}
FAQ
Find()
vs QueryAll()
, which is better?
Find
and QueryAll
both do the same things, searches all of matched html nodes.
The Find
will panics if you give an error XPath query, but QueryAll
will return an error for you.
Can I save my query expression object for the next query?
Yes, you can. We offer the QuerySelector
and QuerySelectorAll
methods, It will accept your query expression object.
Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.
XPath query object cache performance
goos: windows
goarch: amd64
pkg: github.com/antchfx/htmlquery
BenchmarkSelectorCache-4 20000000 55.2 ns/op
BenchmarkDisableSelectorCache-4 500000 3162 ns/op
How to disable caching?
htmlquery.DisableSelectorCache = true
Questions
Please let me know if you have any questions.