Browse Prior Art Database

Process for Removal of Duplication of company names when Data Mining

IP.com Disclosure Number: IPCOM000022095D
Original Publication Date: 2004-Feb-24
Included in the Prior Art Database: 2004-Feb-24
Document File: 1 page(s) / 30K

Publishing Venue

IBM

Abstract

Removing duplicate names as part of data cleansing is a task which needs to be commonly performed, especially for data mining. This disclosure articulates a procedure to perform such a task.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 68% of the total text.

Page 1 of 1

Process for Removal of Duplication of company names when Data Mining

1. Describe your invention, stating the problem solved (if appropriate), and indicating the advantages of using the invention. When conducting a data mining excercise, applicants to the companies scheme may have included a company name in the application. A method to group these together is essential to understanding if perhaps a corporate deal should be struck, or it may be used to rank that company in terms of Large, Medium or Small by associated members to that company. The key problem is that names are written differently on the application form, but from a business context are really the same. Example: IBM UK Ltd.

IBM Corp. IBM Microelectronics
I.B.M.

Are all the same company, so to make "business sense" from them we need to be able to consider them as one "IBM".

2. How does the invention solve the problem or achieve an advantage,(a description of "the invention", including figures inline as appropriate)? The algorithm attempts to normalise the free form names using a set of rules and a dictionary to achieve a reasonable result: Rules as follows:

Skip leading blanks

Eliminate C/O or c/o or c o in the first position (Abbreviation for Care of)

Remove spaces when more than one is in a sequence

Eliminate words found in the dictionary, the dictionary contains words such as bill, bob,

builders,aircraft,development,european,china etc. These words from study typically augment core names which are of...