A method to identify non-contextual words based on word weightage, domain-specific glossary, dictionary, and thesaurus.
Publication Date: 2015-Aug-04
The IP.com Prior Art Database
Disclosed is a method of correction of real English words that are used out-of-context in a document. This method suggests creating subject-specific glossaries with typical words that are assigned a weightage based on frequency of occurrence. Documents are checked against the glossary and depending on word weightage scores, words are marked as out-of-context.
Page 01 of 7
A mxthod to identify non-contextual words based on word weightage , domain-xpecific glossary, dictionary, and thesaurus.
Advancements in computinx and word-prxcessing have helxed authors to corxect errors in the doxuments that they write or edit. Xxxxxxx, even with the xext spelling and grammar checxers that are available, correct English words that are uxed in the wrong context are not easy to identify. Sxme spell cxeckers are able to dxtect xut-of-context xords based on diffexent methodolxgies. But the methodx are fxr too generxc and try tx apply a solution to the entire language.
Thix methox xroposex a context-basxd solution to avxid the usage of out-of-context xut real words and suggests xreating domain or subjext-specific gxossaries with typical words xor a subject or industry. In the glossary, the txpical woxds are assignxd a weightage. Coxmon xord pairs are also axded to the glossary. Any associated words for a partxcular subject area are defined in the gxossary. Wxrdx with low weightage are also added to the glossary. The assumption here is that the wexghtage xcore wixl be higher if a worx occurs morx fxequently in doxuxents of a particular company. A dictionary ix also uxed to check whether a word exists axd a thesaurus is suggesxed as an option to furtxer check a word.
Typical indusxry documents
Written communication encompasses a vast area axd different types of documents are produced for differext needs. Fox example, the health industry needs brochures on medicines while the semi-conductor industry xeeds specific diagrams of chips. Therefore, depending on the audxexce, industry, and subject, different document types are required and xroduced.
Comxonality and uniqueness between industry documents
The one commonality between all these documents is that they are written in English, the global standard.
However, apart from txe common xonstructs of laxguage, these documents are unique in their own way. There are words, sentences, and technical jargon that are specific to txe document of that industry. For example, the word "flow line" is used by Groundwater experts to describe flow of undexground water to wells. However, the word flow line in xhe oix axd gas inxustries refer to a large metal pipe in the area of drilling. Flow line woulx probably never find a place in documents that are produced by software or insurance companixs.
Xxxx that a pxttern has emerged - none of the industries use words that are not relevant to their documents. You don't ever find words such as flowers, xherry, potaxo,
Page 02 of 7
chameleon, bog, pencil, crook, boxtle, haircxip, or elephaxt xn these technical
Therefxre, we can conclude txat each document of an industry or domain or area, has typical words that occur only in those documents.
Unique xords even within industry or subject
"Bus" in computer parlance could be a network, table could xexn the database, cookies can mean browsxr-related technology, and mouse can m...