Combining key distance and language specific n-gram distribution for input validation
Publication Date: 2010-Nov-10
The IP.com Prior Art Database
Users often create intentionally wrong data when filling in forms in web sites, for example, when filling in company names or addresses on web sites that allow free downloading of software. The incorrect data may become a problem later, e.g., when companies use the data for sending newsletter, mailings, or flyers. Manually validating all data input in such web forms is not feasible. However, data that is intentionally created wrongly usually consists of characters that lie very close to each other on a standard computer keyboard. This finding enables detection of such intentionally wrongly created data.
Page 01 of 5
Combining key distance and language specific n -gram distribution for input validation
When registering on a web site like developerWorks for downloading free software, documentation, tutorials and other resources, the data entered by users is often of very bad quality. Of course the bad data quality is caused by reasons known from other scenarios like typos, wrong spelling, typing the wrong information into the wrong field, etc... but especially in the given scenario, two additional special reasons come into play:
The user needs to satisfy some input rules of the form but doesn't really want to give away personal data. To continue working with the form, the user types an arbitrary character sequence to fill the input field. (Example: Before you're able to download a tutorial from IBM developerWorks, you have to register with your email address and give away personal data like your name, company, business address, etc...)
The user is asked for information he/she doesn't know or he/she doesn't understand. (Example: when downloading a Linux distribution, you're sometimes asked for your company's name. Sometimes this field is mandatory, although Linux is of course also free for private usage).
The wrong data now becomes a problem when companies use it for sending newsletters, mailings, flyers and advertising material. Sending newsletters to invalid addresses and not existing companies means burning money. However manually validating each of theses addresses before sending a flyer even means spending more money.
The idea is now to automatically validate the data and eliminate fake data which was entered to complete the registration process, but actually if is no use for the company. In contrast to other data cleansing algorithms, the approach described below is easy to implement, fully self-contained (no dictionaries, etc... are needed), easy to calculate and can be done in real-time.
The core idea of this disclosure is based on the awareness, that data that is intentionally created wrongly usually consists of characters that lie very close to each other on a standard computer keyboard. A good example for this is the company name download statistics as of the IBM developerWorks site as described in the example given in the abstract.
The download statistics can be found here: http://w3.alphaworks.ibm.com/stats/techcompany.
Examples for fake company names taken from this list:
Page 02 of 5
As you can see, all these names consist of letters and characters that are placed very close together on a computer keyboard. For humans reading these names, it is quite obvious that they are fake names, as the distribution of the characters doesn't match the letter distribution of a natural language.
The key distance of a term is calculated by creating the terms bigrams (moving a window of two characters over the term) and calculating the s...