Popularity
3.2
Growing
Activity
0.0
-
65
2
3
Programming language: Go
License: MIT License
Latest version: v1.2.1
html2data alternatives and similar packages
Based on the "Utility" category.
Alternatively, view html2data alternatives based on common mentions on social networks and blogs.
-
Koazee
A StreamLike, Immutable, Lazy Loading and smart Golang Library to deal with slices. -
gotabulate
Gotabulate - Easily pretty-print your tabular data with Go -
strutil-go
Golang metrics for calculating string similarity and other string utility functions -
regroup
Match regex group into go struct using struct tags and automatic parsing -
frontmatter
Go library for detecting and decoding various content front matter formats -
Tagify
Tagify produces a set of tags from a given source. Source can be either an HTML page, a Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages. -
TySug
A project around helping to prevent typing typos. TySug (Typo Suggestions) suggests alternative words with respect to keyboard layouts -
parseargs-go
A string argument parser that understands quotes and backslashes
Clean code begins in your IDE with SonarLint
Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.
Promo
www.sonarlint.org
Do you think we are missing an alternative of html2data or a related project?
README
html2data
Library and cli-utility for extracting data from HTML via CSS selectors
Install
Install package and command line utility:
go get -u github.com/msoap/html2data/cmd/html2data
Install package only:
go get -u github.com/msoap/html2data
Methods
FromReader(io.Reader)
- create document for parseFromURL(URL, [config URLCfg])
- create document from http(s) URLFromFile(file)
- create document from local filedoc.GetData(css map[string]string)
- get texts by CSS selectorsdoc.GetDataFirst(css map[string]string)
- get texts by CSS selectors, get first entry for each selector or ""doc.GetDataNested(outerCss string, css map[string]string)
- extract nested data by CSS-selectors from another CSS-selectordoc.GetDataNestedFirst(outerCss string, css map[string]string)
- extract nested data by CSS-selectors from another CSS-selector, get first entry for each selector or ""doc.GetDataSingle(css string)
- get one result by one CSS selector
or with config:
doc.GetData(css map[string]string, html2data.Cfg{DontTrimSpaces: true})
doc.GetDataNested(outerCss string, css map[string]string, html2data.Cfg{DontTrimSpaces: true})
doc.GetDataSingle(css string, html2data.Cfg{DontTrimSpaces: true})
Pseudo-selectors
:attr(attr_name)
- getting attribute instead of text, for example getting urls from links:a:attr(href)
:html
- getting HTML instead of text:get(N)
- getting n-th element from list
Example
package main
import (
"fmt"
"log"
"github.com/msoap/html2data"
)
func main() {
doc := html2data.FromURL("http://example.com")
// or with config
// doc := html2data.FromURL("http://example.com", html2data.URLCfg{UA: "userAgent", TimeOut: 10, DontDetectCharset: false})
if doc.Err != nil {
log.Fatal(doc.Err)
}
// get title
title, _ := doc.GetDataSingle("title")
fmt.Println("Title is:", title)
title, _ = doc.GetDataSingle("title", html2data.Cfg{DontTrimSpaces: true})
fmt.Println("Title as is, with spaces:", title)
texts, _ := doc.GetData(map[string]string{"h1": "h1", "links": "a:attr(href)"})
// get all H1 headers:
if textOne, ok := texts["h1"]; ok {
for _, text := range textOne {
fmt.Println(text)
}
}
// get all urls from links
if links, ok := texts["links"]; ok {
for _, text := range links {
fmt.Println(text)
}
}
}
Command line utility
Usage
html2data [options] URL "css selector"
html2data [options] URL :name1 "css1" :name2 "css2"...
html2data [options] file.html "css selector"
cat file.html | html2data "css selector"
Options
-user-agent="Custom UA"
-- set custom user-agent-find-in="outer.css.selector"
-- search in the specified elements instead document-json
-- get result as JSON-dont-trim-spaces
-- get text as is-dont-detect-charset
-- don't detect charset and convert text-timeout=10
-- setting timeout when loading the URL
Install
Download binaries from: releases (OS X/Linux/Windows/RaspberryPi)
Or install from homebrew (MacOS):
brew tap msoap/tools
brew install html2data
# update:
brew upgrade html2data
Using snap (Ubuntu or any Linux distribution with snap):
# install stable version:
sudo snap install html2data
# install the latest version:
sudo snap install --edge html2data
# update
sudo snap refresh html2data
From source:
go get -u github.com/msoap/html2data/cmd/html2data
examples
Get title of page:
html2data https://golang.org/ title
Last blog posts:
html2data https://blog.golang.org/ h3
Getting RSS URL:
html2data https://blog.golang.org/ 'link[type="application/atom+xml"]:attr(href)'
More examples from wiki.