Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Temporal Correlation Analysis for Keywords in Text Stream

IP.com Disclosure Number: IPCOM000191503D
Original Publication Date: 2010-Jan-06
Included in the Prior Art Database: 2010-Jan-06
Document File: 6 page(s) / 111K

Publishing Venue

IBM

Abstract

A program is disclosed that processes textual data with timestamps and detects highly-correlated pairs of keywords in the data. The background of the invention is: the amount of streaming textual data has been drastically increasing. Typical examples of such data are news feeds, online discussions, blog and email. It is a challenge to analyze correlation of occurrences of keywords, taking into consideration both the underlying context and temporal dependency of keywords. More specifically, the problem is defined as follows: For a given "seed" keyword A, compute a correlation score r(A,B) for each B appearing in the current stream data under consideration. r(A,B) should reflect the strength of A to affect the occurrences of B.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 6

Temporal Correlation Analysis for Keywords in Text Stream

Figure 1 shows conceptual diagram of a text stream that is processed by the disclosed program.

Each incoming data unit in the stream is represented by a keyword vector in addition to a timestamp, and is denoted by xi = (vi, ti). (i = 1,2,...n)
vi is a set of keywords, for example vi={stock, weather, plunge}. Let V be a set of all (possible) keywords.

These keywords may be produced by preprocessing of raw text data before implementation of this program.
ti is a timestamp corresponding to the occurrence of vi in the stream.

The objective is to find relevant pairs of keywords (a, b) for a fixed a, where the occurrences of a in the stream affects the occurrences of
b.

Figure 1: Conceptual diagram of a text stream

In addition to xi=(vi, ti) (i=1,2,...,n) we define two similarity functions.

Let be any function for decaying factor for time intervals.

Typically (t) = exp(-at) is useful (a>0 is a constant).

Let be any function for measuring similarities of keyword vectors.

Typically the normal inner product of two vectors is available for f: For x=(x1,x2,...,xm), y=(y1,y2,...,ym), f(x,y):=x1y1+x2y2+...+xmym where xj is j-th keyword in V (keywords are conceptually indexed).

1

[This page contains 4 pictures or other non-text objects]

Page 2 of 6

For a time interval I and a keyword a, define

Fix a keyword a, and two time intervals IA and IB. S(a, IA) and S(b, IB) are shown in Figure 1. For each keyword b the correlation value r(a, IA;b, IB) is defined as follows:

Greater value of r means higher correlation between a and b.

Figure 2 shows the details of calculation of r(a, IA;b, IB) for all possible b.

Output r is an array which contains correlation value of each keyword b.

2

[This page contains 3 pictures or other non-text objects]

Page 3 of 6

Figure 2: Algorithm of calculation of r(a, IA;b, IB).

Time complexity of calculation of r in Figure 2 is O(k|IA||IB|) where k is the average number of keywords in a vector, and |IA| and |IB| are the number of vectors in IA and IB respectively.

In case of (t) is an exponential function of t and f is in a form of inner product, the delta of r can be calculated as follows.

Assume that the program has already calculated r(a, IA;b, I...