Popularity

3.1

Declining

Activity

4.6

Stars 66

Watchers 3

Forks 3

Last Commit about 1 month ago

Programming language: Go

License: MIT License

Tags: HTML Css Utility Text Processing

Latest version: v1.2.1

html2data alternatives and similar packages

Based on the "Utility" category.
Alternatively, view html2data alternatives based on common mentions on social networks and blogs.

xurls

7.5 6.4 html2data VS xurls

Extract urls from text
Koazee

6.5 0.0 html2data VS Koazee

A StreamLike, Immutable, Lazy Loading and smart Golang Library to deal with slices.

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

Promo workos.com

gotabulate

5.8 0.0 html2data VS gotabulate

Gotabulate - Easily pretty-print your tabular data with Go
strutil-go

5.5 5.7 html2data VS strutil-go

Golang metrics for calculating string similarity and other string utility functions
radix

5.0 0.0 html2data VS radix

A fast string sorting algorithm (MSD radix sort)
shell2telegram

4.9 4.5 html2data VS shell2telegram

Telegram bot constructor from command-line
regroup

4.5 4.2 html2data VS regroup

Match regex group into go struct using struct tags and automatic parsing
frontmatter

3.9 4.2 html2data VS frontmatter

Go library for detecting and decoding various content front matter formats
Goa

3.2 0.0 html2data VS Goa

The Go library that will drive you to AOP world!
gofuckyourself

3.2 0.0 html2data VS gofuckyourself

A sanitization-based swear filter for Go.
parth

2.9 0.0 html2data VS parth

Path parsing for segment unmarshaling and slicing.
xj2go

2.6 3.6 html2data VS xj2go

Convert xml and json to go struct
Tagify

2.2 4.3 html2data VS Tagify

Tagify produces a set of tags from a given source. Source can be either an HTML page, a Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.
kace

1.7 0.0 html2data VS kace

Common case conversions covering common initialisms.
TySug

1.6 1.9 html2data VS TySug

A project around helping to prevent typing typos. TySug (Typo Suggestions) suggests alternative words with respect to keyboard layouts
parseargs-go

1.4 0.0 html2data VS parseargs-go

A string argument parser that understands quotes and backslashes
textwrap

1.2 0.0 html2data VS textwrap

Port of Python's "textwrap" module to Go
Bookgot

1.0 0.0 html2data VS Bookgot

Simple script for farm free books from PackPub.com
godazo

0.4 0.0 html2data VS godazo

Stupid simple slide presenter or static site creator.

Do you think we are missing an alternative of html2data or a related project?

Add another 'Utility' Package

Popular Comparisons

README

html2data

Library and cli-utility for extracting data from HTML via CSS selectors

Install

Install package and command line utility:

go get -u github.com/msoap/html2data/cmd/html2data

Install package only:

go get -u github.com/msoap/html2data

Methods

FromReader(io.Reader) - create document for parse
FromURL(URL, [config URLCfg]) - create document from http(s) URL
FromFile(file) - create document from local file
doc.GetData(css map[string]string) - get texts by CSS selectors
doc.GetDataFirst(css map[string]string) - get texts by CSS selectors, get first entry for each selector or ""
doc.GetDataNested(outerCss string, css map[string]string) - extract nested data by CSS-selectors from another CSS-selector
doc.GetDataNestedFirst(outerCss string, css map[string]string) - extract nested data by CSS-selectors from another CSS-selector, get first entry for each selector or ""
doc.GetDataSingle(css string) - get one result by one CSS selector

or with config:

doc.GetData(css map[string]string, html2data.Cfg{DontTrimSpaces: true})
doc.GetDataNested(outerCss string, css map[string]string, html2data.Cfg{DontTrimSpaces: true})
doc.GetDataSingle(css string, html2data.Cfg{DontTrimSpaces: true})

Pseudo-selectors

:attr(attr_name) - getting attribute instead of text, for example getting urls from links: a:attr(href)
:html - getting HTML instead of text
:get(N) - getting n-th element from list

Example

package main

import (
    "fmt"
    "log"

    "github.com/msoap/html2data"
)

func main() {
    doc := html2data.FromURL("http://example.com")
    // or with config
    // doc := html2data.FromURL("http://example.com", html2data.URLCfg{UA: "userAgent", TimeOut: 10, DontDetectCharset: false})
    if doc.Err != nil {
        log.Fatal(doc.Err)
    }

    // get title
    title, _ := doc.GetDataSingle("title")
    fmt.Println("Title is:", title)

    title, _ = doc.GetDataSingle("title", html2data.Cfg{DontTrimSpaces: true})
    fmt.Println("Title as is, with spaces:", title)

    texts, _ := doc.GetData(map[string]string{"h1": "h1", "links": "a:attr(href)"})
    // get all H1 headers:
    if textOne, ok := texts["h1"]; ok {
        for _, text := range textOne {
            fmt.Println(text)
        }
    }
    // get all urls from links
    if links, ok := texts["links"]; ok {
        for _, text := range links {
            fmt.Println(text)
        }
    }
}

Command line utility

Usage

html2data [options] URL "css selector"
html2data [options] URL :name1 "css1" :name2 "css2"...
html2data [options] file.html "css selector"
cat file.html | html2data "css selector"

Options

-user-agent="Custom UA" -- set custom user-agent
-find-in="outer.css.selector" -- search in the specified elements instead document
-json -- get result as JSON
-dont-trim-spaces -- get text as is
-dont-detect-charset -- don't detect charset and convert text
-timeout=10 -- setting timeout when loading the URL

Install

Download binaries from: releases (OS X/Linux/Windows/RaspberryPi)

Or install from homebrew (MacOS):

brew tap msoap/tools
brew install html2data
# update:
brew upgrade html2data

Using snap (Ubuntu or any Linux distribution with snap):

# install stable version:
sudo snap install html2data

# install the latest version:
sudo snap install --edge html2data

# update
sudo snap refresh html2data

From source:

go get -u github.com/msoap/html2data/cmd/html2data

examples

Get title of page:

html2data https://golang.org/ title

Last blog posts:

html2data https://blog.golang.org/ h3

Getting RSS URL:

html2data https://blog.golang.org/ 'link[type="application/atom+xml"]:attr(href)'

More examples from wiki.

html2data

Library and cli for extracting data from HTML via CSS selectors

html2data alternatives and similar packages

xurls

Koazee

WorkOS - The modern identity platform for B2B SaaS

gotabulate

strutil-go

radix

shell2telegram

regroup

frontmatter

Goa

gofuckyourself

parth

xj2go

Tagify

kace

TySug

parseargs-go

textwrap

Bookgot

godazo

Popular Comparisons

README

html2data

Install

Methods

Pseudo-selectors

Example

Command line utility

Usage

Options

Install

examples

See also