Method for Efficient Query and Processing upon Huge Data Set
Original Publication Date: 2009-Oct-15
Included in the Prior Art Database: 2009-Oct-15
This article described a method for efficient ranking query up--sion dataset. This method are tailored for huge data set analysis which is in great need with more and more logs generated. Four main features of this method are listed: a. simultaneously query upon multiple fields/attributes b. process multiple outputs into single dimension vector c. avoid time consuming “join” alike operation d. reducing data set traversing to single time
Method for Efficient Query and Processing upon HugeData Set
With the fast development of telecom services and web applicatopns, huge amount of log files are generated. Much business opportunities rely on analysis of these records. However, how to process such huge record effectively to meet strict time requirements is still an open question.
Here we take a SMS application as an example. It isa typical application in huge record ananlysis. A SMS record has several field of sender and receiver. For only a city, 5 millions of SMS records are generaged every day. There is relative strict time requirement for detecting spam source, typically in severalhours or minutes. If the latency can not meet requiremts, itis too late to prevent sending from spam source.
Figure 1 shows two typical implementations of spam detection program. The first implementation is based on database. As we can see we need to do three "SELECT" operation and one "Join" operation. The JOIN is very time consuming on database. The second implementation is based on Hadoop using MapReduce framework. Here four map reduce functionsare needed. In the first one, the same sender to receriver pairs are filtered out. The second function emits sender and its sending to how many different receivers. The third function sorts the results. The last function generates pair of sending time and receiving time. These two different implementationscan achieve the same function based on different framework.
Figure 1 Two Typical Implementations of Spam Detecting
By deep analyzing above two implementations,
we find the bottleneck and perform extensive experiments to form our approach. Here we propose the
approach to speedup query and processing upon huge data. This approach is composed of four steps. The first step is to process upon multiple field/ attributes. Original implementation is to emit oneitem for each record. Our method is to emit multiple items for one record which can save much time for later function. Step 2 is to merge multiple outputs into single dimension vetor. This work is done in reducefunction. It sorts according to the key. Then map different value to different value range. For example, if thevalue is 1 or -1,
we can use the upper
32 bit representing value range of 1 and lower 32 bit repr...