Popularity
3.1
Growing
Activity
4.1
Declining
57
2
3
Programming language: Go
License: MIT License
Latest version: v1.2.1
html2data alternatives and similar packages
Based on the "Utility" category.
Alternatively, view html2data alternatives based on common mentions on social networks and blogs.
-
Koazee
A StreamLike, Immutable, Lazy Loading and smart Golang Library to deal with slices. -
gotabulate
Gotabulate - Easily pretty-print your tabular data with Go -
regroup
Match regex group into go struct using struct tags and automatic parsing -
strutil-go
Golang metrics for calculating string similarity and other string utility functions -
frontmatter
Go library for detecting and decoding various content front matter formats -
Tagify
Tagify produces a set of tags from a given source. Source can be either an HTML page, a Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages. -
TySug
A project around helping to prevent typing typos. TySug (Typo Suggestions) suggests alternative words with respect to keyboard layouts -
parseargs-go
A string argument parser that understands quotes and backslashes
Less time debugging, more time building
Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
Promo
scoutapm.com
Do you think we are missing an alternative of html2data or a related project?
README
html2data
Library and cli-utility for extracting data from HTML via CSS selectors
Install
Install package and command line utility:
go get -u github.com/msoap/html2data/cmd/html2data
Install package only:
go get -u github.com/msoap/html2data
Methods
FromReader(io.Reader)
- create document for parseFromURL(URL, [config URLCfg])
- create document from http(s) URLFromFile(file)
- create document from local filedoc.GetData(css map[string]string)
- get texts by CSS selectorsdoc.GetDataFirst(css map[string]string)
- get texts by CSS selectors, get first entry for each selector or ""doc.GetDataNested(outerCss string, css map[string]string)
- extract nested data by CSS-selectors from another CSS-selectordoc.GetDataNestedFirst(outerCss string, css map[string]string)
- extract nested data by CSS-selectors from another CSS-selector, get first entry for each selector or ""doc.GetDataSingle(css string)
- get one result by one CSS selector
or with config:
doc.GetData(css map[string]string, html2data.Cfg{DontTrimSpaces: true})
doc.GetDataNested(outerCss string, css map[string]string, html2data.Cfg{DontTrimSpaces: true})
doc.GetDataSingle(css string, html2data.Cfg{DontTrimSpaces: true})
Pseudo-selectors
:attr(attr_name)
- getting attribute instead of text, for example getting urls from links:a:attr(href)
:html
- getting HTML instead of text:get(N)
- getting n-th element from list
Example
package main
import (
"fmt"
"log"
"github.com/msoap/html2data"
)
func main() {
doc := html2data.FromURL("http://example.com")
// or with config
// doc := html2data.FromURL("http://example.com", html2data.URLCfg{UA: "userAgent", TimeOut: 10, DontDetectCharset: false})
if doc.Err != nil {
log.Fatal(doc.Err)
}
// get title
title, _ := doc.GetDataSingle("title")
fmt.Println("Title is:", title)
title, _ = doc.GetDataSingle("title", html2data.Cfg{DontTrimSpaces: true})
fmt.Println("Title as is, with spaces:", title)
texts, _ := doc.GetData(map[string]string{"h1": "h1", "links": "a:attr(href)"})
// get all H1 headers:
if textOne, ok := texts["h1"]; ok {
for _, text := range textOne {
fmt.Println(text)
}
}
// get all urls from links
if links, ok := texts["links"]; ok {
for _, text := range links {
fmt.Println(text)
}
}
}
Command line utility
Usage
html2data [options] URL "css selector"
html2data [options] URL :name1 "css1" :name2 "css2"...
html2data [options] file.html "css selector"
cat file.html | html2data "css selector"
Options
-user-agent="Custom UA"
-- set custom user-agent-find-in="outer.css.selector"
-- search in the specified elements instead document-json
-- get result as JSON-dont-trim-spaces
-- get text as is-dont-detect-charset
-- don't detect charset and convert text-timeout=10
-- setting timeout when loading the URL
Install
Download binaries from: releases (OS X/Linux/Windows/RaspberryPi)
Or install from homebrew (MacOS):
brew tap msoap/tools
brew install html2data
# update:
brew upgrade html2data
Using snap (Ubuntu or any Linux distribution with snap):
# install stable version:
sudo snap install html2data
# install the latest version:
sudo snap install --edge html2data
# update
sudo snap refresh html2data
From source:
go get -u github.com/msoap/html2data/cmd/html2data
examples
Get title of page:
html2data https://golang.org/ title
Last blog posts:
html2data https://blog.golang.org/ h3
Getting RSS URL:
html2data https://blog.golang.org/ 'link[type="application/atom+xml"]:attr(href)'
More examples from wiki.