Browse Prior Art Database

Automated Language Detection

IP.com Disclosure Number: IPCOM000241745D
Publication Date: 2015-May-27
Document File: 4 page(s) / 29K

Publishing Venue

The IP.com Prior Art Database

Related People

Linux Defenders: AUTHOR

Abstract

Disclosed is a method for automated language detection for a given string. It expands on a 2-part heuristic that first determines the scripts of the string and second compare a model of the string with a trigram frequency model of languages containing the script determined first; then it fallbacks on the language for which a dictionary has the most words used in the string.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 4

Automated language detection

December 12, 2014

Abstract

  Disclosed is a method for automated language detection for a given string. It expands on a 2-part heuristic that first determines the scripts of the string and second compare a model of the string with a trigram frequency model of languages containing the script determined first; then it fallbacks on the language for which a dictionary has the most words used in the string.

Keywords. natural language - detection - spelling - KDE -

Problem/ Opportunity

There is an increasingly large number of multilingual people in the world, which means that spell-checking should adapt to this.

Spell-checking software have traditionally been limited to checking a single language at a time, however, which means that people will have to manually check each part of a document that is written in several languages.

An improvement on this would be to allow the spell checking software to automatically detect the various languages used in a document, and adapt to that.

Description of your solution

1. This is a hybrid algorithm combining statistical methods and tradi- tional spell checking methods for identifying the language used in a text.

1


Page 02 of 4

2. Spell-checking methods currently assume that the user has supplied the correct language to be used for checking. Our method allows us to automatically detect this, and not assume that the user has supplied any or correct information about which language is used.

Steps


1. Create a n-gram frequency model for as many languages as possible;

Given a large corpus of text in the target language and a hash-map mapping from n-grams to an integer.

Go through the corpus, and store each n-gram together with the number of times it occurs in the corpus in the hash-map; move through the corpus one character at the time, read in n characters from the current position (this is the n-gram), and either store the n-gram in the hash- map with a count of 1, or increase the count already in the hash-map.

Optionally filter out all n-grams that have more than X spaces (where X < n, for example X = 2 for a trigram model).

Afterwards sort the hash-map by the n-gram counts, and choose the Y most common n-grams and store them (to limit resource usage).

This is ideally done once when developing the spell-checking software, and then distributed with the software.


2. Tokenize text to be spell-checked.

Given a text to be spell checked, split it up into separate texts to be individually tagged as different languages; for example paragraphs, phrases, sentences, etc.

To do this, go through the text and split it up on some special characters or sets of characters (like punctuation or line endings), and store it in a data structure that maps these partial texts to a language.

For the following steps, do each of them individually on the text in each of these, and store the language decided, and fin...