Popularity
7.4
Stable
Activity
3.9
-
848
59
149

Programming language: Go
Latest version: v1.1.0

gojieba alternatives and similar packages

Based on the "Natural Language Processing" category

Do you think we are missing an alternative of gojieba or a related project?

Add another 'Natural Language Processing' Package

README

GoJieba [English](README_EN.md)

Build Status Author Donate Performance License GoDoc Coverage Status codebeat badge Go Report Card Awesome

logo

GoJieba是"结巴"中文分词的Golang语言版本。

简介

  • 支持多种分词方式,包括: 最大概率模式, HMM新词发现模式, 搜索引擎模式, 全模式
  • 核心算法底层由C++实现,性能高效。
  • 字典路径可配置,NewJieba(...string), NewExtractor(...string) 可变形参,当参数为空时使用默认词典(推荐方式)

用法

go get github.com/yanyiwu/gojieba

分词示例

package main

import (
    "fmt"
    "strings"

    "github.com/yanyiwu/gojieba"
)

func main() {
    var s string
    var words []string
    use_hmm := true
    x := gojieba.NewJieba()
    defer x.Free()

    s = "我来到北京清华大学"
    words = x.CutAll(s)
    fmt.Println(s)
    fmt.Println("全模式:", strings.Join(words, "/"))

    words = x.Cut(s, use_hmm)
    fmt.Println(s)
    fmt.Println("精确模式:", strings.Join(words, "/"))
    s = "比特币"
    words = x.Cut(s, use_hmm)
    fmt.Println(s)
    fmt.Println("精确模式:", strings.Join(words, "/"))

    x.AddWord("比特币")
    s = "比特币"
    words = x.Cut(s, use_hmm)
    fmt.Println(s)
    fmt.Println("添加词典后,精确模式:", strings.Join(words, "/"))

    s = "他来到了网易杭研大厦"
    words = x.Cut(s, use_hmm)
    fmt.Println(s)
    fmt.Println("新词识别:", strings.Join(words, "/"))

    s = "小明硕士毕业于中国科学院计算所,后在日本京都大学深造"
    words = x.CutForSearch(s, use_hmm)
    fmt.Println(s)
    fmt.Println("搜索引擎模式:", strings.Join(words, "/"))

    s = "长春市长春药店"
    words = x.Tag(s)
    fmt.Println(s)
    fmt.Println("词性标注:", strings.Join(words, ","))

    s = "区块链"
    words = x.Tag(s)
    fmt.Println(s)
    fmt.Println("词性标注:", strings.Join(words, ","))

    s = "长江大桥"
    words = x.CutForSearch(s, !use_hmm)
    fmt.Println(s)
    fmt.Println("搜索引擎模式:", strings.Join(words, "/"))

    wordinfos := x.Tokenize(s, gojieba.SearchMode, !use_hmm)
    fmt.Println(s)
    fmt.Println("Tokenize:(搜索引擎模式)", wordinfos)

    wordinfos = x.Tokenize(s, gojieba.DefaultMode, !use_hmm)
    fmt.Println(s)
    fmt.Println("Tokenize:(默认模式)", wordinfos)

    keywords := x.ExtractWithWeight(s, 5)
    fmt.Println("Extract:", keywords)
}
我来到北京清华大学
全模式: 我/来到/北京/清华/清华大学/华大/大学
我来到北京清华大学
精确模式: 我/来到/北京/清华大学
比特币
精确模式: 比特/币
比特币
添加词典后,精确模式: 比特币
他来到了网易杭研大厦
新词识别: 他/来到/了/网易/杭研/大厦
小明硕士毕业于中国科学院计算所,后在日本京都大学深造
搜索引擎模式: 小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/,/后/在/日本/京都/大学/日本京都大学/深造
长春市长春药店
词性标注: 长春市/ns,长春/ns,药店/n
区块链
词性标注: 区块链/nz
长江大桥
搜索引擎模式: 长江/大桥/长江大桥
长江大桥
Tokenize: [{长江 0 6} {大桥 6 12} {长江大桥 0 12}]

See example in [jieba_test](jieba_test.go), [extractor_test](extractor_test.go)

性能评测

Jieba中文分词系列性能评测

测试

Unittest

go test ./...

Benchmark

go test -bench "Jieba" -test.benchtime 10s
go test -bench "Extractor" -test.benchtime 10s

客服

  • Email: i@yanyiwu.com


*Note that all licence references and agreements mentioned in the gojieba README section above are relevant to that project's source code only.