Popularity
3.2
Stable
Activity
0.0
Declining
64
3
3
Programming language: Go
License: MIT License
Latest version: v1.2.1
html2data alternatives and similar packages
Based on the "Utility" category.
Alternatively, view html2data alternatives based on common mentions on social networks and blogs.
-
Koazee
A StreamLike, Immutable, Lazy Loading and smart Golang Library to deal with slices. -
gotabulate
Gotabulate - Easily pretty-print your tabular data with Go -
strutil-go
Golang metrics for calculating string similarity and other string utility functions -
regroup
Match regex group into go struct using struct tags and automatic parsing -
frontmatter
Go library for detecting and decoding various content front matter formats -
Tagify
Tagify produces a set of tags from a given source. Source can be either an HTML page, a Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages. -
TySug
A project around helping to prevent typing typos. TySug (Typo Suggestions) suggests alternative words with respect to keyboard layouts -
parseargs-go
A string argument parser that understands quotes and backslashes
Collect and Analyze Billions of Data Points in Real Time
Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
Promo
www.influxdata.com
Do you think we are missing an alternative of html2data or a related project?
README
html2data
Library and cli-utility for extracting data from HTML via CSS selectors
Install
Install package and command line utility:
go get -u github.com/msoap/html2data/cmd/html2data
Install package only:
go get -u github.com/msoap/html2data
Methods
FromReader(io.Reader)
- create document for parseFromURL(URL, [config URLCfg])
- create document from http(s) URLFromFile(file)
- create document from local filedoc.GetData(css map[string]string)
- get texts by CSS selectorsdoc.GetDataFirst(css map[string]string)
- get texts by CSS selectors, get first entry for each selector or ""doc.GetDataNested(outerCss string, css map[string]string)
- extract nested data by CSS-selectors from another CSS-selectordoc.GetDataNestedFirst(outerCss string, css map[string]string)
- extract nested data by CSS-selectors from another CSS-selector, get first entry for each selector or ""doc.GetDataSingle(css string)
- get one result by one CSS selector
or with config:
doc.GetData(css map[string]string, html2data.Cfg{DontTrimSpaces: true})
doc.GetDataNested(outerCss string, css map[string]string, html2data.Cfg{DontTrimSpaces: true})
doc.GetDataSingle(css string, html2data.Cfg{DontTrimSpaces: true})
Pseudo-selectors
:attr(attr_name)
- getting attribute instead of text, for example getting urls from links:a:attr(href)
:html
- getting HTML instead of text:get(N)
- getting n-th element from list
Example
package main
import (
"fmt"
"log"
"github.com/msoap/html2data"
)
func main() {
doc := html2data.FromURL("http://example.com")
// or with config
// doc := html2data.FromURL("http://example.com", html2data.URLCfg{UA: "userAgent", TimeOut: 10, DontDetectCharset: false})
if doc.Err != nil {
log.Fatal(doc.Err)
}
// get title
title, _ := doc.GetDataSingle("title")
fmt.Println("Title is:", title)
title, _ = doc.GetDataSingle("title", html2data.Cfg{DontTrimSpaces: true})
fmt.Println("Title as is, with spaces:", title)
texts, _ := doc.GetData(map[string]string{"h1": "h1", "links": "a:attr(href)"})
// get all H1 headers:
if textOne, ok := texts["h1"]; ok {
for _, text := range textOne {
fmt.Println(text)
}
}
// get all urls from links
if links, ok := texts["links"]; ok {
for _, text := range links {
fmt.Println(text)
}
}
}
Command line utility
Usage
html2data [options] URL "css selector"
html2data [options] URL :name1 "css1" :name2 "css2"...
html2data [options] file.html "css selector"
cat file.html | html2data "css selector"
Options
-user-agent="Custom UA"
-- set custom user-agent-find-in="outer.css.selector"
-- search in the specified elements instead document-json
-- get result as JSON-dont-trim-spaces
-- get text as is-dont-detect-charset
-- don't detect charset and convert text-timeout=10
-- setting timeout when loading the URL
Install
Download binaries from: releases (OS X/Linux/Windows/RaspberryPi)
Or install from homebrew (MacOS):
brew tap msoap/tools
brew install html2data
# update:
brew upgrade html2data
Using snap (Ubuntu or any Linux distribution with snap):
# install stable version:
sudo snap install html2data
# install the latest version:
sudo snap install --edge html2data
# update
sudo snap refresh html2data
From source:
go get -u github.com/msoap/html2data/cmd/html2data
examples
Get title of page:
html2data https://golang.org/ title
Last blog posts:
html2data https://blog.golang.org/ h3
Getting RSS URL:
html2data https://blog.golang.org/ 'link[type="application/atom+xml"]:attr(href)'
More examples from wiki.