BudouX is a standalone, small, and language-neutral phrase segmenter tool that provides beautiful and legible line breaks.
For more details about the project, please refer to the project README.
https://google.github.io/budoux
You can get a list of phrases by feeding a sentence to the parser. The easiest way is to get a parser is loading the default parser for each language.
import com.google.budoux.Parser;
public class App
{
public static void main( String[] args )
{
Parser parser = Parser.loadDefaultJapaneseParser();
System.out.println(parser.parse("今日は良い天気ですね。"));
// [今日は, 良い, 天気ですね。]
}
}- Japanese:
Parser.loadDefaultJapaneseParser() - Simplified Chinese:
Parser.loadDefaultSimplifiedChineseParser() - Traditional Chinese:
Parser.loadDefaultTraditionalChineseParser() - Thai:
Parser.loadDefaultThaiParser()
If you want to use the result in a website, you can use the translateHTMLString
method to get an HTML string that wraps phrases with non-breaking markup,
speicifcally, zero-width space (U+200B).
System.out.println(parser.translateHTMLString("今日は<strong>良い天気</strong>ですね。"));
//<span style="word-break: keep-all; overflow-wrap: anywhere;">今日は<strong>\u200b良い\u200b天気</strong>ですね。</span>Please note that separators are denoted as \u200b in the example above for
illustrative purposes, but the actual output is an invisible string as it's a
zero-width space.
BudouX supports HTML inputs and outputs HTML strings with markup applied to wrap phrases, but it's not meant to be used as an HTML sanitizer. BudouX doesn't sanitize any inputs. Malicious HTML inputs yield malicious HTML outputs. Please use it with an appropriate sanitizer library if you don't trust the input.
This is not an officially supported Google product.