Popularity

4.3

Stable

Activity

0.0

Stable

Stars 85

Watchers 11

Forks 17

Last Commit over 1 year ago

Programming language: Go

License: Apache License 2.0

Tags: Natural Language Processing

Latest version: v0.9.0

segment alternatives and similar packages

Based on the "Natural Language Processing" category.
Alternatively, view segment alternatives based on common mentions on social networks and blogs.

prose

8.7 1.9 segment VS prose

DISCONTINUED. :book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.
go-i18n

8.5 7.1 segment VS go-i18n

Translate your Go program into multiple languages.

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

Promo workos.com

gojieba

8.4 1.9 segment VS gojieba

"结巴"中文分词的Golang版本
gse

8.4 4.4 segment VS gse

Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others.
go-pinyin

7.9 4.5 segment VS go-pinyin

汉字转拼音
spaGO

7.9 0.0 segment VS spaGO

DISCONTINUED. Self-contained Machine Learning and Natural Language Processing library in Go
when

7.6 5.1 segment VS when

A natural language date/time parser with pluggable rules
kagome

7.0 6.4 segment VS kagome

Self-contained Japanese Morphological Analyzer written in pure Go
whatlanggo

6.8 0.0 segment VS whatlanggo

Natural language detection library for Go
nlp

6.3 0.0 segment VS nlp

DISCONTINUED. [UNMANTEINED] Extract values from strings and fill your structs with nlp.
sentences

6.2 4.5 segment VS sentences

A multilingual command line sentence tokenizer in Golang
universal-translator

6.1 0.0 segment VS universal-translator

:speech_balloon: i18n Translator for Go/Golang using CLDR data + pluralization rules
locales

5.9 0.0 segment VS locales

:earth_americas: a set of locales generated from the CLDR Project which can be used independently or within an i18n package; these were built for use with, but not exclusive to https://github.com/go-playground/universal-translator
getlang

4.9 0.0 segment VS getlang

Natural language detection package in pure Go
RAKE.go

4.5 0.0 segment VS RAKE.go

A Go port of the Rapid Automatic Keyword Extraction algorithm (RAKE)
go-unidecode

4.5 3.1 segment VS go-unidecode

ASCII transliterations of Unicode text.
go-nlp

4.3 0.0 segment VS go-nlp

DISCONTINUED. Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
gounidecode

4.2 0.0 segment VS gounidecode

Unicode transliterator for #golang
go-stem

3.9 0.0 segment VS go-stem

Word Stemming in Go
textcat

3.8 0.0 segment VS textcat

A Go package for n-gram based text categorization, with support for utf-8 and raw text
MMSEGO

3.6 0.0 segment VS MMSEGO

Chinese word splitting algorithm MMSEG in GO
go-localize

3.3 0.0 segment VS go-localize

i18n (Internationalization and localization) engine written in Go, used for translating locale strings.
address

3.3 6.5 segment VS address

Address handling for Go.
go2vec

3.2 0.0 segment VS go2vec

Read and use word2vec vectors in Go
stemmer

3.1 0.0 segment VS stemmer

Stemmer packages for Go programming language. Includes English, German and Dutch stemmers.
petrovich

2.9 3.8 segment VS petrovich

Golang port of Petrovich - an inflector for Russian anthroponyms.
porter2

2.9 0.0 segment VS porter2

High Performance Porter2 Stemmer
iuliia-go

2.8 1.8 segment VS iuliia-go

Transliterate Cyrillic → Latin in every possible way
dpar

2.8 3.2 segment VS dpar

Neural network transition-based dependency parser (in Rust)
govader

2.7 0.0 segment VS govader

vader sentiment analysis in go
go-mystem

2.6 0.0 segment VS go-mystem

CGo bindings to Yandex.Mystem
go-tinydate

2.5 0.0 segment VS go-tinydate

A tiny date object in Go. Tinydate uses only 4 bytes of memory
spreak

2.4 6.4 segment VS spreak

Flexible translation and humanization library for Go, based on the concepts behind gettext.
paicehusk

2.4 0.0 segment VS paicehusk

Golang implementation of the Paice/Husk Stemming Algorithm
snowball

2.4 0.0 L1 segment VS snowball

Cgo binding for Snowball C library
gotokenizer

2.0 0.0 segment VS gotokenizer

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)
golibstemmer

2.0 0.0 segment VS golibstemmer

Go bindings for the snowball libstemmer library including porter 2
detectlanguage

2.0 0.0 segment VS detectlanguage

Detect Language API Go Client
icu

1.8 0.0 segment VS icu

Cgo binding for icu4c library
libtextcat

1.8 0.0 segment VS libtextcat

Cgo binding for libtextcat C library
t

1.8 3.5 segment VS t

t: translation util for go, using GNU gettext
shamoji

1.3 0.0 segment VS shamoji

The shamoji (杓文字) is a word filtering package
porter

1.2 0.0 segment VS porter

porter stemmer
gosentiwordnet

0.9 0.0 segment VS gosentiwordnet

💬 Sentiment analyzer library using SentiWordnet in Go
go-eco

0.5 0.0 segment VS go-eco

Automatically exported from code.google.com/p/go-eco
govader-backend

0.5 2.6 segment VS govader-backend

Sentimental Analysis Microservice
spelling-corrector

0.3 0.0 segment VS spelling-corrector

Spelling corrector for Spanish language

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of segment or a related project?

Add another 'Natural Language Processing' Package

Popular Comparisons

README

segment

A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

Features

Currently only segmentation at Word Boundaries is supported.

License

Apache License Version 2.0

Usage

The functionality is exposed in two ways:

You can use a bufio.Scanner with the SplitWords implementation of SplitFunc. The SplitWords function will identify the appropriate word boundaries in the input text and the Scanner will return tokens at the appropriate place.
```
scanner := bufio.NewScanner(...)
scanner.Split(segment.SplitWords)
for scanner.Scan() {
    tokenBytes := scanner.Bytes()
}
if err := scanner.Err(); err != nil {
    t.Fatal(err)
}
```
Sometimes you would also like information returned about the type of token. To do this we have introduce a new type named Segmenter. It works just like Scanner but additionally a token type is returned.
```
segmenter := segment.NewWordSegmenter(...)
for segmenter.Segment() {
    tokenBytes := segmenter.Bytes())
    tokenType := segmenter.Type()
}
if err := segmenter.Err(); err != nil {
    t.Fatal(err)
}
```

Choosing Implementation

By default segment does NOT use the fastest runtime implementation. The reason is that it adds approximately 5s to compilation time and may require more than 1GB of ram on the machine performing compilation.

However, you can choose to build with the fastest runtime implementation by passing the build tag as follows:

    -tags 'prod'

Generating Code

Several components in this package are generated.

Several Ragel rules files are generated from Unicode properties files.
Ragel machine is generated from the Ragel rules.
Test tables are generated from the Unicode test files.

All of these can be generated by running:

    go generate

Fuzzing

There is support for fuzzing the segment library with go-fuzz.

Install go-fuzz if you haven't already:

go get github.com/dvyukov/go-fuzz/go-fuzz
go get github.com/dvyukov/go-fuzz/go-fuzz-build

Build the package with go-fuzz:

go-fuzz-build github.com/blevesearch/segment

Convert the Unicode provided test cases into the initial corpus for go-fuzz:
```
go test -v -run=TestGenerateWordSegmentFuzz -tags gofuzz_generate
```

Run go-fuzz:

go-fuzz -bin=segment-fuzz.zip -workdir=workdir

Status

*Note that all licence references and agreements mentioned in the segment README section above are relevant to that project's source code only.