segment alternatives and similar packages
Based on the "Natural Language Processing" category.
Alternatively, view segment alternatives based on common mentions on social networks and blogs.
-
prose
DISCONTINUED. :book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction. -
gse
Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others. -
universal-translator
:speech_balloon: i18n Translator for Go/Golang using CLDR data + pluralization rules -
locales
:earth_americas: a set of locales generated from the CLDR Project which can be used independently or within an i18n package; these were built for use with, but not exclusive to https://github.com/go-playground/universal-translator -
go-nlp
DISCONTINUED. Utilities for working with discrete probability distributions and other tools useful for doing NLP work. -
go-localize
i18n (Internationalization and localization) engine written in Go, used for translating locale strings. -
gotokenizer
A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)
WorkOS - The modern identity platform for B2B SaaS
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of segment or a related project?
Popular Comparisons
README
segment
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29
Features
- Currently only segmentation at Word Boundaries is supported.
License
Apache License Version 2.0
Usage
The functionality is exposed in two ways:
You can use a bufio.Scanner with the SplitWords implementation of SplitFunc. The SplitWords function will identify the appropriate word boundaries in the input text and the Scanner will return tokens at the appropriate place.
scanner := bufio.NewScanner(...) scanner.Split(segment.SplitWords) for scanner.Scan() { tokenBytes := scanner.Bytes() } if err := scanner.Err(); err != nil { t.Fatal(err) }
Sometimes you would also like information returned about the type of token. To do this we have introduce a new type named Segmenter. It works just like Scanner but additionally a token type is returned.
segmenter := segment.NewWordSegmenter(...) for segmenter.Segment() { tokenBytes := segmenter.Bytes()) tokenType := segmenter.Type() } if err := segmenter.Err(); err != nil { t.Fatal(err) }
Choosing Implementation
By default segment does NOT use the fastest runtime implementation. The reason is that it adds approximately 5s to compilation time and may require more than 1GB of ram on the machine performing compilation.
However, you can choose to build with the fastest runtime implementation by passing the build tag as follows:
-tags 'prod'
Generating Code
Several components in this package are generated.
- Several Ragel rules files are generated from Unicode properties files.
- Ragel machine is generated from the Ragel rules.
- Test tables are generated from the Unicode test files.
All of these can be generated by running:
go generate
Fuzzing
There is support for fuzzing the segment library with go-fuzz.
Install go-fuzz if you haven't already:
go get github.com/dvyukov/go-fuzz/go-fuzz go get github.com/dvyukov/go-fuzz/go-fuzz-build
Build the package with go-fuzz:
go-fuzz-build github.com/blevesearch/segment
Convert the Unicode provided test cases into the initial corpus for go-fuzz:
go test -v -run=TestGenerateWordSegmentFuzz -tags gofuzz_generate
Run go-fuzz:
go-fuzz -bin=segment-fuzz.zip -workdir=workdir
Status
*Note that all licence references and agreements mentioned in the segment README section above
are relevant to that project's source code only.