Browse Prior Art Database

Method and System for Validating Collective Classification using Cohorts

IP.com Disclosure Number: IPCOM000237433D
Publication Date: 2014-Jun-18
Document File: 7 page(s) / 235K

Publishing Venue

The IP.com Prior Art Database

Related People

Eric Bax: INVENTOR [+2]

Abstract

A method and system is disclosed for validating collective classification using cohorts. The method and system holds out a cohort that includes validation nodes with known labels and working nodes with unknown labels. The validation nodes are used to estimate the error rate of a held-out classifier. The working nodes are used to estimate the rate of disagreement between the held-out classifier and a classifier based on all nodes. The sum of the estimates is an estimated bound on error rate of a full classifier based on all nodes.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 24% of the total text.

Method and System for Validating Collective Classification using Cohorts

Abstract

A method and system is disclosed for validating collective classification using cohorts.  The method and system holds out a cohort that includes validation nodes with known labels and working nodes with unknown labels.  The validation nodes are used to estimate the error rate of a held-out classifier.  The working nodes are used to estimate the rate of disagreement between the held-out classifier and a classifier based on all nodes.  The sum of the estimates is an estimated bound on error rate of a full classifier based on all nodes.

Description

Disclosed is a method and system for validating collective classification using cohorts.  The method and system estimates a generalization capability of a classifier that operates on a network.  For example, a classifier that operates on a social network to estimate types of music which users prefer.  The method and system is computationally efficient and accommodates network nodes that join depending on which user is already in the network.  The method is valid regardless of a functional form of the classifier.

The method and system holds out a cohort that includes validation nodes with known labels and working nodes with unknown labels.  The validation nodes are used to estimate the error rate of a held-out classifier.  Thereafter, the working nodes are used to estimate the rate of disagreement between the held-out classifier and a classifier based on all nodes.  The sum of the estimates is an estimated bound on error rate of a full classifier based on all nodes.

The method and system presents probably approximately correct (PAC) bounds for network classifiers based on cohorts.  A PAC bound based on probabilities over different subsets of a cohort being selected for labeling (Theorem 1) is proved.  A bound based on probabilities over randomly generated cohorts, similar to a usual setting for PAC bounds (Corollary 5) is presented.

The method and system validates a network classifier in two steps.  First, a portion of labeled nodes which are in the same cohort as the nodes which are to be labelled are withheld from the network.  A collective classification is performed on the withheld nodes and the labels of the withheld nodes are used to evaluate the accuracy of the collective classification for the network without the withheld nodes.  Second, the method and system includes evaluating the rate of disagreement between collective classification with and without the withheld nodes.  The method and system gives a bound on the difference in error rates between the validated withheld classifier and the full classifier based on all nodes.

Consider F to be a full set of nodes in a network, with some nodes including known labels and others unknown.  A cohort is defined to be a subset of F for which whether labels are known or unknown is determined at random, independently and with the same probability for each...