Surety is performing system maintenance this weekend. Electronic date stamps on new Prior Art Database disclosures may be delayed.
Browse Prior Art Database

Temporal Correlation Analysis for Keywords in Text Stream

IP.com Disclosure Number: IPCOM000191503D
Original Publication Date: 2010-Jan-06
Included in the Prior Art Database: 2010-Jan-06
Document File: 6 page(s) / 111K

Publishing Venue



A program is disclosed that processes textual data with timestamps and detects highly-correlated pairs of keywords in the data. The background of the invention is: the amount of streaming textual data has been drastically increasing. Typical examples of such data are news feeds, online discussions, blog and email. It is a challenge to analyze correlation of occurrences of keywords, taking into consideration both the underlying context and temporal dependency of keywords. More specifically, the problem is defined as follows: For a given "seed" keyword A, compute a correlation score r(A,B) for each B appearing in the current stream data under consideration. r(A,B) should reflect the strength of A to affect the occurrences of B.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 6

Temporal Correlation Analysis for Keywords in Text Stream

Figure 1 shows conceptual diagram of a text stream that is processed by the disclosed program.

Each incoming data unit in the stream is represented by a keyword vector in addition to a timestamp, and is denoted by xi = (vi, ti). (i = 1,2,...n)
vi is a set of keywords, for example vi={stock, weather, plunge}. Let V be a set of all (possible) keywords.

These keywords may be produced by preprocessing of raw text data before implementation of this program.
ti is a timestamp corresponding to the occurrence of vi in the stream.

The objective is to find relevant pairs of keywords (a, b) for a fixed a, where the occurrences of a in the stream affects the occurrences of

Figure 1: Conceptual diagram of a text stream

In addition to xi=(vi, ti) (i=1,2,...,n) we define two similarity functions.

Let be any function for decaying factor for time intervals.

Typically (t) = exp(-at) is useful (a>0 is a constant).

Let be any function for measuring similarities of keyword vectors.

Typically the normal inner product of two vectors is available for f: For x=(x1,x2,...,xm), y=(y1,y2,...,ym), f(x,y):=x1y1+x2y2+...+xmym where xj is j-th keyword in V (keywords are conceptually indexed).


[This page contains 4 pictures or other non-text objects]

Page 2 of 6

For a time interval I and a keyword a, define

Fix a keyword a, and two time intervals IA and IB. S(a, IA) and S(b, IB) are shown in Figure 1. For each keyword b the correlation value r(a, IA;b, IB) is defined as follows:

Greater value of r means higher correlation between a and b.

Figure 2 shows the details of calculation of r(a, IA;b, IB) for all possible b.

Output r is an array which contains correlation value of each keyword b.


[This page contains 3 pictures or other non-text objects]

Page 3 of 6

Figure 2: Algorithm of calculation of r(a, IA;b, IB).

Time complexity of calculation of r in Figure 2 is O(k|IA||IB|) where k is the average number of keywords in a vector, and |IA| and |IB| are the number of vectors in IA and IB respectively.

In case of (t) is an exponential function of t and f is in a form of inner product, the delta of r can be calculated as follows.

Assume that the program has already calculated r(a, IA;b, I...