segment alternatives and similar packages
Based on the "Natural Language Processing" category.
Alternatively, view segment alternatives based on common mentions on social networks and blogs.
-
prose
A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. -
gojieba
This is a Go implementation of jieba which a Chinese word splitting algorithm. -
gse
Go efficient text segmentation; support english, chinese, japanese and other. -
spaGO
Self-contained Machine Learning and Natural Language Processing library in Go. -
whatlanggo
A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc). -
sentences
A sentence tokenizer: converts text into a list of sentences. -
universal-translator
๐ฌ i18n Translator for Go/Golang using CLDR data + pluralization rules -
locales
๐ a set of locales generated from the CLDR Project which can be used independently or within an i18n package; these were built for use with, but not exclusive to https://github.com/go-playground/universal-translator -
RAKE.go
A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE) -
go-nlp
Utilities for working with discrete probability distributions and other tools useful for doing NLP work. -
textcat
A Go package for n-gram based text categorization, with support for utf-8 and raw text -
MMSEGO
This is a GO implementation of MMSEG which a Chinese word splitting algorithm. -
stemmer
Stemmer packages for Go programming language. Includes English and German stemmers. -
petrovich
Petrovich is the library which inflects Russian names to given grammatical case. -
snowball
Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality Snowball native. -
go-localize
Simple and easy to use i18n (Internationalization and localization) engine -
golibstemmer
Go bindings for the snowball libstemmer library including porter 2 -
libtextcat
Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2. -
icu
Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1. -
go-tinydate
A tiny date object in Go. Tinydate uses only 4 bytes of memory -
gotokenizer
A tokenizer based on the dictionary and Bigram language models for Golang. (Now only support chinese segmentation) -
porter
This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm. -
gosentiwordnet
Sentiment analyzer using sentiwordnet lexicon in Go. -
go-eco
Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models. -
detectlanguage
Language Detection API Go Client. Supports batch requests, short phrase or single word language detection.
Scout APM - Leading-edge performance monitoring starting at $39/month
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest. Visit our partner's website for more details.
Do you think we are missing an alternative of segment or a related project?
Popular Comparisons
README
segment
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29
Features
- Currently only segmentation at Word Boundaries is supported.
License
Apache License Version 2.0
Usage
The functionality is exposed in two ways:
You can use a bufio.Scanner with the SplitWords implementation of SplitFunc. The SplitWords function will identify the appropriate word boundaries in the input text and the Scanner will return tokens at the appropriate place.
scanner := bufio.NewScanner(...) scanner.Split(segment.SplitWords) for scanner.Scan() { tokenBytes := scanner.Bytes() } if err := scanner.Err(); err != nil { t.Fatal(err) }
Sometimes you would also like information returned about the type of token. To do this we have introduce a new type named Segmenter. It works just like Scanner but additionally a token type is returned.
segmenter := segment.NewWordSegmenter(...) for segmenter.Segment() { tokenBytes := segmenter.Bytes()) tokenType := segmenter.Type() } if err := segmenter.Err(); err != nil { t.Fatal(err) }
Choosing Implementation
By default segment does NOT use the fastest runtime implementation. The reason is that it adds approximately 5s to compilation time and may require more than 1GB of ram on the machine performing compilation.
However, you can choose to build with the fastest runtime implementation by passing the build tag as follows:
-tags 'prod'
Generating Code
Several components in this package are generated.
- Several Ragel rules files are generated from Unicode properties files.
- Ragel machine is generated from the Ragel rules.
- Test tables are generated from the Unicode test files.
All of these can be generated by running:
go generate
Fuzzing
There is support for fuzzing the segment library with go-fuzz.
Install go-fuzz if you haven't already:
go get github.com/dvyukov/go-fuzz/go-fuzz go get github.com/dvyukov/go-fuzz/go-fuzz-build
Build the package with go-fuzz:
go-fuzz-build github.com/blevesearch/segment
Convert the Unicode provided test cases into the initial corpus for go-fuzz:
go test -v -run=TestGenerateWordSegmentFuzz -tags gofuzz_generate
Run go-fuzz:
go-fuzz -bin=segment-fuzz.zip -workdir=workdir
Status
*Note that all licence references and agreements mentioned in the segment README section above
are relevant to that project's source code only.