Popularity
2.1
Growing
Activity
0.0
Stable
21
3
7

Programming language: Go
License: Apache License 2.0
Latest version: v1.1.0

gotokenizer alternatives and similar packages

Based on the "Natural Language Processing" category.
Alternatively, view gotokenizer alternatives based on common mentions on social networks and blogs.

Do you think we are missing an alternative of gotokenizer or a related project?

Add another 'Natural Language Processing' Package

README

gotokenizer GoDoc Coverage Status Go Report Card License Awesome

A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)

Motivation

I wanted a simple tokenizer that has no unnecessary overhead using the standard library only, following good practices and well tested code.

Features

  • Support Maximum Matching Method
  • Support Minimum Matching Method
  • Support Reverse Maximum Matching
  • Support Reverse Minimum Matching
  • Support Bidirectional Maximum Matching
  • Support Bidirectional Minimum Matching
  • Support using Stop Tokens
  • Support Custom word Filter

Installation

go get -u github.com/xujiajun/gotokenizer

Usage

package main

import (
    "fmt"

    "github.com/xujiajun/gotokenizer"
)

func main() {
    text := "gotokenizer是一款基于字典和Bigram模型纯go语言编写的分词器,支持6种分词算法。支持stopToken过滤和自定义word过滤功能。"

    dictPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/dict.txt"
    // NewMaxMatch default wordFilter is NumAndLetterWordFilter
    mm := gotokenizer.NewMaxMatch(dictPath)
    // load dict
    mm.LoadDict()

    fmt.Println(mm.Get(text)) //[gotokenizer 是 一款 基于 字典 和 Bigram 模型 纯 go 语言 编写 的 分词器 , 支持 6 种 分词 算法 。 支持 stopToken 过滤 和 自定义 word 过滤 功能 。] <nil>

    // enabled filter stop tokens 
    mm.EnabledFilterStopToken = true
    mm.StopTokens = gotokenizer.NewStopTokens()
    stopTokenDicPath := "/Users/xujiajun/go/src/github.com/xujiajun/gotokenizer/data/zh/stop_tokens.txt"
    mm.StopTokens.Load(stopTokenDicPath)

    fmt.Println(mm.Get(text)) //[gotokenizer 一款 字典 Bigram 模型 go 语言 编写 分词器 支持 6 种 分词 算法 支持 stopToken 过滤 自定义 word 过滤 功能] <nil>
    fmt.Println(mm.GetFrequency(text)) //map[6:1 种:1 算法:1 过滤:2 支持:2 Bigram:1 模型:1 编写:1 gotokenizer:1 go:1 分词器:1 分词:1 word:1 功能:1 一款:1 语言:1 stopToken:1 自定义:1 字典:1] <nil>

}

More examples see tests

Contributing

If you'd like to help out with the project. You can put up a Pull Request.

Author

License

The gotokenizer is open-sourced software licensed under the Apache-2.0

Acknowledgements

This package is inspired by the following:

https://github.com/ysc/word


*Note that all licence references and agreements mentioned in the gotokenizer README section above are relevant to that project's source code only.