gotokenizer alternatives and similar packages
Based on the "Natural Language Processing" category.
Alternatively, view gotokenizer alternatives based on common mentions on social networks and blogs.
-
prose
A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. -
gojieba
This is a Go implementation of jieba which a Chinese word splitting algorithm. -
gse
Go efficient text segmentation; support english, chinese, japanese and other. -
go-i18n
A package and an accompanying tool to work with localized text. -
spaGO
Self-contained Machine Learning and Natural Language Processing library in Go. -
whatlanggo
A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc). -
sentences
A sentence tokenizer: converts text into a list of sentences. -
locales
🌎 a set of locales generated from the CLDR Project which can be used independently or within an i18n package; these were built for use with, but not exclusive to https://github.com/go-playground/universal-translator -
universal-translator
💬 i18n Translator for Go/Golang using CLDR data + pluralization rules -
go-nlp
Utilities for working with discrete probability distributions and other tools useful for doing NLP work. -
RAKE.go
A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE) -
gounidecode
Unicode transliterator (also known as unidecode) for Go -
segment
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 -
MMSEGO
This is a GO implementation of MMSEG which a Chinese word splitting algorithm. -
textcat
A Go package for n-gram based text categorization, with support for utf-8 and raw text -
stemmer
Stemmer packages for Go programming language. Includes English and German stemmers. -
paicehusk
Golang implementation of the Paice/Husk Stemming Algorithm -
petrovich
Petrovich is the library which inflects Russian names to given grammatical case. -
snowball
Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality Snowball native. -
go-localize
Simple and easy to use i18n (Internationalization and localization) engine -
golibstemmer
Go bindings for the snowball libstemmer library including porter 2 -
icu
Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1. -
libtextcat
Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2. -
go-tinydate
A tiny date object in Go. Tinydate uses only 4 bytes of memory -
porter
This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm. -
gosentiwordnet
Sentiment analyzer using sentiwordnet lexicon in Go. -
go-eco
Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models. -
detectlanguage
Language Detection API Go Client. Supports batch requests, short phrase or single word language detection.
Scout APM - Leading-edge performance monitoring starting at $39/month
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest. Visit our partner's website for more details.
Do you think we are missing an alternative of gotokenizer or a related project?
Popular Comparisons
README
gotokenizer

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)
Motivation
I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.
Features
- Support Maximum Matching Method
- Support Minimum Matching Method
- Support Reverse Maximum Matching
- Support Reverse Minimum Matching
- Support Bidirectional Maximum Matching
- Support Bidirectional Minimum Matching
- Support using Stop Tokens
- Support Custom word Filter
Installation
go get -u github.com/xujiajun/gotokenizer
Usage
package main
import (
"fmt"
"github.com/xujiajun/gotokenizer"
)
func main() {
text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器,支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"
dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
// NewMaxMatch default wordFilter is NumAndLetterWordFilter
mm := gotokenizer.NewMaxMatch(dictPath)
// load dict
mm.LoadDict()
fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 , 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>
// enabled filter stop tokens
mm.EnabledFilterStopToken = true
mm.StopTokens = gotokenizer.NewStopTokens()
stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
mm.StopTokens.Load(stopTokenDicPath)
fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>
}
More examples see tests
Contributing
If you'd like to help out with the project. You can put up a Pull Request.
Author
License
The gotokenizer is open-sourced software licensed under the Apache-2.0
Acknowledgements
This package is inspired by the following:
*Note that all licence references and agreements mentioned in the gotokenizer README section above
are relevant to that project's source code only.