Lazy method for fast voting-based ensemble scoring
Publication Date: 2015-May-27
The IP.com Prior Art Database
The present invention seeks to overcome the deficiency described above by providing a lazy method for fast voting-based ensemble scoring, where each base model's relative contribution to the final score is considered on the fly, via updating the shared data structure containing the partial score for the record, until termination condition is met. Experiment conducted in the present invention shows that average scoring time cost is reduced 20%-40% in random case. Generally even lower time cost can be expected, because each base model does not generate prediction randomly.
Page 01 of 8
Lazy method for fast voting
Lazy method for fast voting-
based ensemble scoring
Ensemble Data Mining Methods are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. Ensemble methods are widely and effectively used in financial and banking industry (e.g. credit scoring), bioinformatics (e.g. protein/drug design) and so on.
Majority voting and weighted majority voting are frequently used for classification problems. There are other ensemble methods existed such as Stacking(Wolpert 1992), Mixture of Experts (Jacobs 1991), and Algebraic based on probabilities. However, many empirical studies have shown that the (weighted) majority voting often works remarkably well (Kittler 1998).
To score an ensemble model the base models are each scored and the scores are combined to produce a final score, where better prediction accuracy are achieved, but more time is taken up during ensemble scoring. This deficiency has been amplified in Big Data era, especially when real-time scoring is required. But there has been surprisingly little work on improving ensemble scoring speed.
Conventional ensemble scoring includes two steps, firstly each base model scores to give its own prediction; secondly generated predictions are combined in final ensemble process according to specific ensemble method used (e.g., voting for classification, averaging for regression). Thus the total time cost is the sum of the two parts, while the first step usually takes up most of the time.
The present invention provides a lazy method for voting-based ensemble scoring, which usually doesn't need to score all base models, instead fuses the conventional two steps together using a well-designed data structure to give fast ensemble prediction. Detailed process is depicted in FIGURE 1 below,
--based ensemble scoring
Page 02 of 8
Preparation work is done in step 101, including data structure set up and base model weight normalization. For example, in a classification problem with 5 target values (A,B,C,D,E) and 10 base models with their weights and predictions [(1,A), (2,B), (3,C), (4,D), (5,E), (6,A), (7,B), (8,C), (9,A), (10,A)], for each
Page 03 of 8
incoming record, the initial status of data structure is described in FIGURE 2,
There are 3 variables(top, second, weightLeft) in the figure, which will be specified soon. After weight normalization, we have normalized weights as follows,
[(normalized weight, prediction)] = [(0.018,A), (0.036,B), (0.055,C), (0.073,D), (0.091,E), (0.109,A), (0.127,B), (0.145,A), (0.164,C), (0.182,A)]
In step 102,as per normalized weights, 10 base models are sorted in descending order below,
[(normalized weight, prediction)] = [(0.182,A), (0.164,C), (0.145,A), (0.127,B), (0.109,A), (0.091,E), (0.073,D), (0.055,C), (0.036,B),(0.018,A)]
Beginning in step 103,a loop starts. Firstly base model with highest weight (0.182,A) is selected...