Browse Prior Art Database

Method for detecting a homographic attack in a webpage by means of language identification and comparison

IP.com Disclosure Number: IPCOM000010253D
Publication Date: 2002-Nov-13
Document File: 6 page(s) / 93K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for detecting a homographic attack in a webpage by means of language identification and comparison. Benefits include improved functionality.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 42% of the total text.

Method for detecting a homographic attack in a webpage by means of language identification and comparison

Disclosed is a method for detecting a homographic attack in a webpage by means of language identification and comparison. Benefits include improved functionality.

Background

              A homograph is a word or string that can have two different interpretations.  In common usage, a homograph is a word with two different meanings, such as “fair” (either and adjective meaning “just, impartial” or a noun meaning “competitive exhibition usually with accompanying entertainment and amusements”).  In communications, a homograph can also arise from the use of the UNICODE character set.  UNICODE defines codesets (the numeric codes that represent characters) and corresponding glyphs (the physical representation of a character) for many different languages, so that a document written in a particular language can be written in the codeset corresponding to the language and viewed with the language-appropriate glyphs.  But because different languages use the same glyph for different characters, it can happen that the same glyphs have different codes; for example, glyph “C” could be “cee” in Roman characters (i.e., the ASCII codeset in UNICODE), but “ess” in Cyrillic characters (i.e., the Cyrillic codeset in UNICODE).  This, too, is a homograph.

 

              The Internet Engineering Task Force (IETF), the Internet Assigned Numbers Authority (IANA) and others have ruled that domain names and uniform resource locators (URLs) may be spelled in UNICODE.  These decisions mean that a malefactor can spoof a domain name or URL by spelling it with characters from a different UNICODE codeset than the end-user expected.  The web browser and the HTTP protocol will use the codes to identify the domain name or URL, but the user will reasonably expect the domain name for URL that the glyphs seem to spell.  The malefactor could thus steer the user to the malefactor’s website rather than the website which the user intended to visit.  This is a homograph attack.

              No effective method exists conventionally to protect naïve Internet users from homographic attacks.  They can occur on PC-based platforms and information servers.  But a defense can be created by identifying the language used in a HTML page and comparing the identified language with the UNICODE codeset used in the page, to determine whether a homograph attack may be occurring. 

              A variety of techniques exist for performing language identification from written text, including:

•             Trigram models

•             Hidden Markov models (HMMs)

•             Parsing

•             Dictionary lookup

General description

              The disclosed method is the use of language identification in a firewall or router for security against homographic attack.

Advantages

              The disclosed method provides advantages, including improved functionality due to security processing for homo...