sentences alternatives and similar packages
Based on the "Natural Language Processing" category.
Alternatively, view sentences alternatives based on common mentions on social networks and blogs.
-
prose
DISCONTINUED. :book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction. -
gse
Go efficient multilingual NLP and text segmentation; support English, Chinese, Japanese and others. -
universal-translator
:speech_balloon: i18n Translator for Go/Golang using CLDR data + pluralization rules -
locales
:earth_americas: a set of locales generated from the CLDR Project which can be used independently or within an i18n package; these were built for use with, but not exclusive to https://github.com/go-playground/universal-translator -
go-nlp
DISCONTINUED. Utilities for working with discrete probability distributions and other tools useful for doing NLP work. -
segment
A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29 -
go-localize
i18n (Internationalization and localization) engine written in Go, used for translating locale strings. -
gotokenizer
A tokenizer based on the dictionary and Bigram language models for Go. (Now only support chinese segmentation)
InfluxDB - Purpose built for real-time analytics at any scale.
* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.
Do you think we are missing an alternative of sentences or a related project?
Popular Comparisons
README
Sentences - A command line sentence tokenizer
This command line utility will convert a blob of text into a list of sentences.
Install
go get gopkg.in/neurosnap/sentences.v1
go install gopkg.in/neurosnap/sentences.v1/_cmd/sentences
Binaries
Linux
- [Linux 386](https:///storage.cloud.google.com/go-sentences/sentences_linux-386.tar.gz)
- Linux AMD64
Mac
Windows
Command
[Command line](sentences.gif?raw=true)
Get it
go get gopkg.in/neurosnap/sentences.v1
Use it
import (
"fmt"
"gopkg.in/neurosnap/sentences.v1"
"gopkg.in/neurosnap/sentences.v1/data"
)
func main() {
text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
the long shot. In short order, the Legislature attempted to pass a law allowing
former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`
// Compiling language specific data into a binary file can be accomplished
// by using `make <lang>` and then loading the `json` data:
b, _ := data.Asset("data/english.json");
// load the training data
training, _ := sentences.LoadTraining(b)
// create the default sentence tokenizer
tokenizer := sentences.NewSentenceTokenizer(training)
sentences := tokenizer.Tokenize(text)
for _, s := range sentences {
fmt.Println(s.Text)
}
}
English
This package attempts to fix some problems I noticed for english.
import (
"fmt"
"gopkg.in/neurosnap/sentences.v1/english"
)
func main() {
text := "Hi there. Does this really work?"
tokenizer, err := english.NewSentenceTokenizer(nil)
if err != nil {
panic(err)
}
sentences := tokenizer.Tokenize(text)
for _, s := range sentences {
fmt.Println(s.Text)
}
}
Contributing
I need help maintaining this library. If you are interested in contributing to this library then please start by looking at the golder-rules branch which tests the Golden Rules for english sentence tokenization created by the Pragmatic Segmenter library.
Create an issue for a particular failing test and submit an issue/PR.
I'm happy to help anyone willing to contribute.
Customizable
Sentences was built around composability, most major components of this package can be extended.
Eager to make adhoc changes but don't know how to start?
Have a look at github.com/neurosnap/sentences/english
for a solid example.
Notice
I have not tested this tokenizer in any other language besides English. By default the command line utility loads english. I welcome anyone willing to test the other languages to submit updates as needed.
A primary goal for this package is to be multilingual so I'm willing to help in any way possible.
This library is a port of the nltk's punkt tokenizer.
A Punkt Tokenizer
An unsupervised multilingual sentence boundary detection library for golang. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelyhoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.
There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbrevation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.
Unsupervised multilingual sentence boundary detection
Performance
Using Brown Corpus which is annotated American English text, we compare this package with other libraries across multiple programming languages.
Library | Avg Speed (s, 10 runs) | Accuracy (%) |
---|---|---|
Sentences | 1.96 | 98.95 |
NLTK | 5.22 | 99.21 |