Browse Prior Art Database

Method and System for Clustering Articles in an Online Environment based on Mathematical Models and Clustering Algorithms

IP.com Disclosure Number: IPCOM000238573D
Publication Date: 2014-Sep-04
Document File: 2 page(s) / 21K

Publishing Venue

The IP.com Prior Art Database

Related People

Maxim Sviridenko: INVENTOR [+6]

Abstract

A method and system is disclosed for clustering articles in an online environment using one or more mathematical models and one or more clustering algorithms. One or more articles are clustered based on similarity in their content and displayed to a user. The online environment can be one of, but not limited to, a finance environment dealing with one or more stock tickers.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Method and System for Clustering Articles in an Online Environment based on Mathematical Models and Clustering Algorithms

Abstract

A method and system is disclosed for clustering articles in an online environment using one or more mathematical models and one or more clustering algorithms.  One or more articles are clustered based on similarity in their content and displayed to a user.  The online environment can be one of, but not limited to, a finance environment dealing with one or more stock tickers. 

Description

Disclosed is a method and system for clustering articles in an online environment using one or more mathematical models and one or more clustering algorithms.  One or more articles are clustered based on similarity in their content and displayed to a user.  The online environment is, but need not be limited to, a finance environment dealing with one or more stock tickers.

The method and system receives a continuously arriving stream of one or more articles with zero or more stock ticker symbols and an arrival timestamp.  The method and system then applies basic natural language processing techniques to generate a feature vector with rational positive coordinates for each of the one or more articles.  The natural language processing techniques are, but need not be limited to, stemming, a bag of words model and a term frequency- inverse document frequency (TF-IDF) model.  The method and system then generates values for one or more features corresponding to one or more words in the article.  The one or more features are, but need not be limited to, word pairs and triples.  The method and system limits the total number of features generated to a reasonable number and uses a sparse vector representation to reflect sparsity of the feature vector.

The method and system then utilizes the one or more clustering algorithms to dynamically decide a cluster to which an article is to be attached.  The cluster is one of, an already existing cluster and a new cluster.  The method and system excludes one or more existing clusters from consideration based on a predefined time period input to a clustering algorithm.  For example, if a most recent article in the existing cluster is older than the predefined time period input (say one day), the method and system excludes the existing cluster from consideration.

Let  be remaining number of clusters in iteration of the clustering algorithm.  For a set of articles  arrived before the iteration , there is a partition  such that  and  for all .  Each article  for  has an associated ticker set , timestamp  and the feature vector .  In the beginning of th...