Browse Prior Art Database

Method and System for Validating Nearest Neighbor Classifiers Using McDiarmid Bounds

IP.com Disclosure Number: IPCOM000234072D
Publication Date: 2014-Jan-09
Document File: 4 page(s) / 128K

Publishing Venue

The IP.com Prior Art Database

Related People

Eric Bax: INVENTOR

Abstract

A method and system is disclosed for validating nearest neighbor classifiers using McDiarmid bounds is disclosed. The method and system develops an error bound for k nearest neighbor classifiers by applying McDiarmid's inequality to the leave-one-out estimate. The nearest neighbor classifiers are based on a set of in-sample examples, each with an input and a label. The classifier labels out-of-sample examples, the inputs are known but not the labels. To classify an out-of-sample example, the k nearest neighbor classifier identifies the k in-sample examples that are "closest" to the input of the out-of-sample example. Thereafter, the method and system provides the label shared by the majority of the k neighbors to the out-of-sample example.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 50% of the total text.

Method and System for Validating Nearest Neighbor Classifiers Using McDiarmid Bounds

Abstract

A method and system is disclosed for validating nearest neighbor classifiers using McDiarmid bounds is disclosed.  The method and system develops an error bound for k nearest neighbor classifiers by applying McDiarmid's inequality to the leave-one-out estimate.  The nearest neighbor classifiers are based on a set of in-sample examples, each with an input and a label.  The classifier labels out-of-sample examples, the inputs are known but not the labels.  To classify an out-of-sample example, the k nearest neighbor classifier identifies the k in-sample examples that are "closest" to the input of the out-of-sample example.  Thereafter, the method and system provides the label shared by the majority of the k neighbors to the out-of-sample example.

Description

Disclosed is a method and system for validating nearest neighbor classifiers using McDiarmid bounds.  The nearest neighbor classifiers are based on a set of in-sample examples, each with an input and a label.  The input can be any data such as, but not limited to, contacts and recent browsing behavior.  An example of the label can be whether someone likes jazz.  The classifier labels out-of-sample examples, the inputs are known but not the labels.  To classify an out-of-sample example, the k nearest neighbor classifier identifies the k in-sample examples that are "closest" to the input of the out-of-sample example.  Thereafter, the method and system provides the label shared by the majority of the k neighbors to the out-of-sample example.

The method assesses the accuracy of k-nearest neighbor classifiers based on the balance of influence among the in-sample examples.  The method and system gives a lower error bound if no single example in the in-sample examples is likely to be the nearest neighbor to a large fraction of out-of-sample examples.  Thus, the bound is superior to previously known bounds.  The method and system also includes a version of a validation method tailored for the big data setting.  The version uses only a subsample of the in-sample data to reduce computation of the error bound.  As a side effect, the subsampling can actually improve the bound.

Consider a scenario where n in-sample examples which are drawn independent and identically distributed from a joint input label distribution D and g* is the k nearest neighbor classifier based on all in-sample examples.  The goal of the method and system is to bound r*, the expected error rate of g* over out-of-sample examples drawn from D.

For each example i, let gi be the k nearest neighbor classifier based on the other n - 1 in-sample examples.  Let ^r be the leave-one-out error rate estimate for g*: the fraction of leave-one-out classifiers gi that misclassify example i.  The following theorem gives an error bound for g* in terms of the leave-one-out estimate:

Theorem 1: For  > 0,

... (1)

where m is the maximum ov...