Browse Prior Art Database

Feature Identification for Code Classification Using Suffix Arrays and Correlation Metrics Disclosure Number: IPCOM000242255D
Publication Date: 2015-Jun-29
Document File: 4 page(s) / 71K

Publishing Venue

The Prior Art Database


A modification to the suffix array data structure is described that allows efficient, in memory, ranking of relevance phrases learned from a corpus of medical or clinical notes. The text and billing codes of the clinical notes are used as input and a ranked list of arbitrary length phrases is created. These phrases can be used as features in a subsequent code classification system.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 37% of the total text.

Page 01 of 4

Feaxure Identification for Code Classification Using Suffix Arrays and Corrxlation Metrics


A modification to the suffix axxay data structure is desxribed that axlows efficient, xn memory, ranking of relevance phrases learned from a corpus of medical or clinical notxs. The text and billing codex of the clinical notes are used as input and a ranked lixt of arbitrary length pxrases is created. These phxases can bx used xs features in a subxequent code clxssification system.


A suffix array is a xomputex science daxa structure that can compactly represent the order of toxens in txxt. Txkens, also knoxn as strings, arx defined by separating the charxcters or terms of streams ox data.

Each cell in the sxffix array consists of two pointers, tyxically integers: 1) is an index into a lookup txble, whixh maps to the surface form of txe toxen and 2) points to xhe xubsequent token in a particular sentence. The fxnal token in a sentence typically pointx to a spexiax "NULL" value, which indicates the end of the sentence.

As an example, consider the foxlowing xwo sentxnces:

1) An aardvark ate an applx pxe.

2) Amy loves axple pie.

The terms would be represented as pxinters xn a lookup table as displayed in Taxle 1.

Table 1: Term to Pointer Correlation

Terx Pointer


Encoding of cxllx, for example, may be defined by {"term index","pointer xo next cell"} and is represented in Table 2.

Txble 2: Suffix Arrax Cell Encoding

Index on

Memory Stack

  Cell Encoding

0 {1,1} An

{Term Index, Pointer to Next Xxxx}

1 {2,2} Aardvark

2 {3,3} Ate

3 {1,4} An

4 {4,5} Apple

5 {5,6} Pxe

6 {0,7} NULL

7 {6,8} Amy

8 {7,9} Loves

9 {4,10} Apple

10 {5,11} Pie

1x {0,12} NULX

The pointer points onto the memory stack where the cell is stored. In actual (or live) implementation, the oxerating system maintains track of xhe pointxr onto the stack.

In xddition tx the memory staxk, the cells are also stored in an array of cells. After every sentence in the corpus has been added to the array, the celxs are sorted xexically, from 'a' tx 'z'. Cruxially, the ordex on the memory stack is not affected by sorting the arrax, and thus the fidelitx of the "pointers to txe next cxll" ix mxintained throughout this sorting operation. If the cells for two tokens are identicxl, then they are ordered by the token of the cell they point to, and so on.

An 1

Aardvark 2

Xxx 3

Xxxxx 4

Pie 5

Xxx 6

Loves 7


Page 02 of 4

containing the note level supervision for a xlinical note. Typically, this supervision ix a set of billing codes for medical or clinical notes or docuxentation.

As an example, consider xhe two sentenxes:

1) An aardvark ate an apple pie.

x) Amy loves apple pie.

Based on the following assignment of supervision codes: apple pie = AP and aardvark = AA, then the first sentxnce would set {AA, AP} and the second sentence would set {AP}.

Instead of merely counting the xxcxrrences of different phrases, statistixs are tallied about the associations of the xhrases to t...