Popularity

6.9

Stable

Activity

0.0

Stable

Stars 636

Watchers 24

Forks 80

Last Commit about 1 year ago

Description

Dataflow kit is a Scraping framework for Gophers. DFK extracts structured data from web pages, following the specified extractors.

It can be used in many ways for data mining, data processing or archiving.

Programming language: Go

License: BSD 3-clause "New" or "Revised" License

Tags: HTML Text Processing Parsing

Dataflow kit alternatives and similar packages

Based on the "Text Processing" category.
Alternatively, view Dataflow kit alternatives based on common mentions on social networks and blogs.

micro-editor

9.9 8.9 Dataflow kit VS micro-editor

A modern and intuitive terminal-based text editor
GoQuery

9.7 6.6 Dataflow kit VS GoQuery

A little like that j-thing, only in Go.

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

Promo workos.com

sh

9.2 7.8 Dataflow kit VS sh

A shell parser, formatter, and interpreter with bash support; includes shfmt
blackfriday

9.1 0.0 Dataflow kit VS blackfriday

Blackfriday: a markdown processor for Go
toml

9.0 7.6 Dataflow kit VS toml

TOML parser for Golang with reflection.
go-humanize

8.8 3.2 Dataflow kit VS go-humanize

Go Humans! (formatters for units to human friendly sizes)
goldmark

8.6 6.9 Dataflow kit VS goldmark

:trophy: A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.
bluemonday

8.5 5.6 Dataflow kit VS bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
gofeed

8.3 6.1 Dataflow kit VS gofeed

Parse RSS, Atom and JSON feeds in Go
inject

7.9 0.0 Dataflow kit VS inject

Package inject provides a reflect based injector.
xurls

7.5 6.4 Dataflow kit VS xurls

Extract urls from text
slug

7.4 5.5 Dataflow kit VS slug

URL-friendly slugify with multiple languages support.
commonregex

7.2 0.0 Dataflow kit VS commonregex

🍫 A collection of common regular expressions for Go
html-to-markdown

6.9 5.3 Dataflow kit VS html-to-markdown

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.
htmlquery

6.9 5.2 Dataflow kit VS htmlquery

htmlquery is golang XPath package for HTML query.
mxj

6.9 4.5 Dataflow kit VS mxj

Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.
xpath

6.8 6.8 Dataflow kit VS xpath

XPath package for Golang, supports HTML, XML, JSON document query.
go-runewidth

6.8 2.6 Dataflow kit VS go-runewidth

wcwidth for golang
omniparser

6.7 4.6 Dataflow kit VS omniparser

omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
gographviz

6.6 1.4 Dataflow kit VS gographviz

Parses the Graphviz DOT language in golang
Koazee

6.5 0.0 Dataflow kit VS Koazee

A StreamLike, Immutable, Lazy Loading and smart Golang Library to deal with slices.
go-pkg-rss

6.4 0.0 Dataflow kit VS go-pkg-rss

This package reads RSS and Atom feeds and provides a caching mechanism that adheres to the feed specs.
gotext

6.3 5.1 Dataflow kit VS gotext

Go (Golang) GNU gettext utilities package
go-edlib

6.2 1.8 Dataflow kit VS go-edlib

📚 String comparison and edit distance algorithms library, featuring : Levenshtein, LCS, Hamming, Damerau levenshtein (OSA and Adjacent transpositions algorithms), Jaro-Winkler, Cosine, etc...
gotabulate

5.8 0.0 Dataflow kit VS gotabulate

Gotabulate - Easily pretty-print your tabular data with Go
go-nmea

5.7 3.0 Dataflow kit VS go-nmea

A NMEA parser library in pure Go
goribot

5.5 6.1 Dataflow kit VS goribot

A simple golang spider/scraping framework,build a spider in 3 lines.
strutil-go

5.5 5.7 Dataflow kit VS strutil-go

Golang metrics for calculating string similarity and other string utility functions
goq

5.4 0.0 Dataflow kit VS goq

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library
xquery

5.3 0.0 Dataflow kit VS xquery

XQuery lets you extract data from HTML/XML documents using XPath expression.
gospider

5.2 3.6 Dataflow kit VS gospider

⚡ Light weight Golang spider framework | 轻量的 Golang 爬虫框架
github_flavored_markdown

5.2 0.0 Dataflow kit VS github_flavored_markdown

GitHub Flavored Markdown renderer with fenced code block highlighting, clickable header anchor links.
go-pkg-xmlx

5.1 0.0 Dataflow kit VS go-pkg-xmlx

Extension to the standard Go XML package. Maintains a node tree that allows forward/backwards browsing and exposes some simple single/multi-node search functions.
radix

5.0 0.0 Dataflow kit VS radix

A fast string sorting algorithm (MSD radix sort)
sdp

4.9 0.0 Dataflow kit VS sdp

RFC 4566 SDP implementation in go
shell2telegram

4.9 3.8 Dataflow kit VS shell2telegram

Telegram bot constructor from command-line
editorconfig-core-go

4.9 6.5 Dataflow kit VS editorconfig-core-go

EditorConfig Core written in Go
podcast

4.8 0.0 Dataflow kit VS podcast

iTunes and RSS 2.0 Podcast Generator in Golang
go-vcard

4.6 4.4 Dataflow kit VS go-vcard

A Go library to parse and format vCard
did

4.5 0.0 Dataflow kit VS did

A golang package to work with Decentralized Identifiers (DIDs)
regroup

4.5 4.2 Dataflow kit VS regroup

Match regex group into go struct using struct tags and automatic parsing
go-fixedwidth

4.3 4.4 Dataflow kit VS go-fixedwidth

Encoding and decoding for fixed-width formatted data
go-zero-width

4.2 0.0 Dataflow kit VS go-zero-width

Zero-width character detection and removal for Go
goregen

4.2 0.0 Dataflow kit VS goregen

randexp for Go.
cat

4.1 3.8 Dataflow kit VS cat

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
frontmatter

3.9 4.2 Dataflow kit VS frontmatter

Go library for detecting and decoding various content front matter formats
Ren'Py graph vizualiser

3.9 4.8 Dataflow kit VS Ren'Py graph vizualiser

Draws a flowchart graph of any Visual Novel from Renpy .rpy files !
pagser

3.9 2.7 Dataflow kit VS pagser

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler
go-slugify

3.9 0.0 Dataflow kit VS go-slugify

Pretty Slug.
align

3.8 1.8 Dataflow kit VS align

A general purpose application and library for aligning text.

Do you think we are missing an alternative of Dataflow kit or a related project?

Add another 'Text Processing' Package

Popular Comparisons

README

Dataflow kit

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors.

You can use it in many ways for data mining, data processing or archiving.

The Web Scraping Pipeline

Web-scraping pipeline consists of 3 general components:

Downloading an HTML web-page. (Fetch Service)
Parsing an HTML page and retrieving data we're interested in (Parse Service)
Encoding parsed data to CSV, MS Excel, JSON, JSON Lines or XML format.

Fetch service

fetch.d server is intended for html web pages content download. Depending on Fetcher type, web page content is downloaded using either Base Fetcher or Chrome fetcher.

Base fetcher uses standard golang http client to fetch pages as is. It works faster than Chrome fetcher. But Base fetcher cannot render dynamic javascript driven web pages.

Chrome fetcher is intended for rendering dynamic javascript based content. It sends requests to Chrome running in headless mode.

A fetched web page is passed to parse.d service.

Parse service

parse.d is the service that extracts data from downloaded web page following the rules listed in configuration JSON file. Extracted data is returned in CSV, MS Excel, JSON or XML format.

Note: Sometimes Parse service cannot extract data from some pages retrieved by default Base fetcher. Empty results may be returned while parsing Java Script generated pages. Parse service then attempts to force Chrome fetcher to render the same dynamic javascript driven content automatically. Have a look at https://scrape.dataflowkit.com/persons/page-0 which is a sample of JavaScript driven web page.

Dataflow kit benefits:

Scraping of JavaScript generated pages;
Data extraction from paginated websites;
Processing infinite scrolled pages.
Sсraping of websites behind login form;
Cookies and sessions handling;
Following links and detailed pages processing;
Managing delays between requests per domain;
Following robots.txt directives;
Saving intermediate data in Diskv or Mongodb. Storage interface is flexible enough to add more storage types easily;
Encode results to CSV, MS Excel, JSON(Lines), XML formats;
Dataflow kit is fast. It takes about 4-6 seconds to fetch and then parse 50 pages.
Dataflow kit is suitable to process quite large volumes of data. Our tests show the time needed to parse appr. 4 millions of pages is about 7 hours.

Installation

go get -u github.com/slotix/dataflowkit

Usage

Docker

Install Docker and Docker Compose
Start services.

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose up

This command fetches docker images automatically and starts services.

Launch parsing in the second terminal window by sending POST request to parse daemon. Some json configuration files for testing are available in /examples folder. curl -XPOST 127.0.0.1:8001/parse --data-binary "@$GOPATH/src/github.com/slotix/dataflowkit/examples/books.toscrape.com.json" Here is the sample json configuration file:

{
    "name":"collection",
    "request":{
       "url":"https://example.com"
    },
    "fields":[
       {
          "name":"Title",
          "selector":".product-container a",
          "extractor":{
             "types":["text", "href"],
             "filters":[
                "trim",
                "lowerCase"
             ],
             "params":{
                "includeIfEmpty":false
             }
          }
       },
       {
          "name":"Image",
          "selector":"#product-container img",
          "extractor":{
             "types":["alt","src","width","height"],
             "filters":[
                "trim",
                "upperCase"
             ]
          }
       },
       {
          "name":"Buyinfo",
          "selector":".buy-info",
          "extractor":{
             "types":["text"],
             "params":{
                "includeIfEmpty":false
             }
          }
       }
    ],
    "paginator":{
       "selector":".next",
       "attr":"href",
       "maxPages":3
    },
    "format":"json",
    "fetcherType":"chrome",
    "paginateResults":false
}

Read more information about scraper configuration JSON files at our GoDoc reference

Extractors and filters are described at https://godoc.org/github.com/slotix/dataflowkit/extract

To stop services just press Ctrl+C and run cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose down --remove-orphans --volumes

Click on image to see CLI in action.

Manual way

Start Chrome docker container docker run --init -it --rm -d --name chrome --shm-size=1024m -p=127.0.0.1:9222:9222 --cap-add=SYS_ADMIN \ yukinying/chrome-headless-browser

Headless Chrome is used for fetching web pages to feed a Dataflow kit parser.

Build and run fetch.d service cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/fetch.d && go build && ./fetch.d
In new terminal window build and run parse.d service cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/parse.d && go build && ./parse.d
Launch parsing. See step 3. from the previous section.

Run tests

docker-compose -f test-docker-compose.yml up -d
./test.sh
To stop services just run docker-compose -f test-docker-compose.yml down

Front-End

Try https://dataflowkit.com/dfk Front-end with Point-and-click interface to Dataflow kit services. It generates JSON config file and sends POST request to DFK Parser

Click on image to see Dataflow kit in action.

License

This is Free Software, released under the BSD 3-Clause License.

Contributing

You are welcome to contribute to our project.

Please submit your issues
Fork the project

alt tag

*Note that all licence references and agreements mentioned in the Dataflow kit README section above are relevant to that project's source code only.

Dataflow kit

Extract structured data from web sites. Web sites scraping.

Description

Dataflow kit alternatives and similar packages

Popular Comparisons

README

Dataflow kit

The Web Scraping Pipeline

Fetch service

Parse service

Dataflow kit benefits:

Installation

Usage

Docker

Manual way

Run tests

Front-End

License

Contributing