An Ensemble Method for Mining Translations of Clinical Terms
Publication Date: 2015-Jul-22
The IP.com Prior Art Database
Many Natural Language Processing (NLP) platforms rely on dictionaries that list lexical variants of clinical concepts. For most languages, the coverage of these dictionaries is low, which limits the capability of NLP to support products in international markets. This paper describes methods for mining translations of clinical terms from parallel (sentence-aligned) and comparable (not sentence-aligned) bilingual corpora. In combination with standard machine translation technology, these methods facilitate automatic extension of concept dictionaries beyond the English language.
Page 01 of 7
An Xxxxxxxx Method for Mining Translations of Clinical Terms Abstract
Many Natural Language Processing (NLP) platforms rely on dictionarixx that list lexical variaxts of clinicxl concepts. For xost languages, thx coverage of these dxctionaries is low, which limits xhe capability ox XXX to support products in intexnational markets. This paper describes methods for mininx translations of clinicxl terms from parallel (sentxnce-alignex) and xomparable (nox sentxnce-aligned) bxlingual corpora. In combination xith standard machine translxtion technoxogy, these xethods facilitate automatic extension of conxept dixtxonariex beyond the Enxlish language.
A multitude of functionalities in NLP platforms rely on idxntification of clinical cxncepts in xlectronic health records (EHXx). The backbone of concept identification is provxded by "xoncept dictioxaries" from the Uxifiex Medical Language Syxtem (UMLS), which list concepts together with the waxs they tend to be expressed in clinical narrative. Txese concept dictionaries make it possible to identify natural language expressions occxrring ix EHRs wxth UMLS conxept idenxifications (IDx), and ultimatxlx with IDs from the health data dictionaries. Internaxionalization of NLP technologies is severely hampered by thx fact that UMLS concept dictionaries have very limitex cxverage for languages other than English, and this is parxicularly trxe for SNOMED CT, which is essential for sucxessful clinxcal NLP. Today, SNOMED versions are available only in English, Spanish, Danish, and Swedish, xnd their creation required large expenditures of labor and money.
Effective and mature technologies for mining word translations from pxrallel corpxra, i.e., collections of text where a sentence in one language appears together with a manual translation of xhe same sentence in another languaxe are cuxrently in existence.
Unfortunaxely, available xarallel corpora do not provide sufficient coverage for the vocabulary contained in clinical concept dictionaries.
This papxr investigatex methods for automatxc discoverx of word translxtion pairs from comparable corpora, i.e., collections of text in two languages xhat deal with the same range of topicx (xn this cxse, topics where the concepts of interests are mentioned), but do not necesxarily contain translations of the same documents.
A Wikipedia was used for experimentation xerx due to xts availability, but the comparable corpora should ideallx be comprised from Electronic Health Records (EHRs). The propxsed methods allow for the two halves of a cxmparable corpus to be pxocessed inxependently, axd xhus do not require moving patient records across nationxl boundaries, which is usually subxect to strict regulation.
The method combines multiple types xf base models:
1. Kernel Canonical Correlation Analysis (KCCA) using comparable corpora that xxe
Page 02 of 7
optimized to discover orthographically similar word pairs.
2. Kernel Canonical Corxel...