Browse Prior Art Database

Intelligent Reduction of Noise in Big Data with Report Based Filtering Technique

IP.com Disclosure Number: IPCOM000238014D
Publication Date: 2014-Jul-25
Document File: 4 page(s) / 72K

Publishing Venue

The IP.com Prior Art Database

Abstract

1. ABSTRACT Today, Big Data innovation is running up against some formidable challenges: unchecked growth in data volumes leading to storage cost overruns, the immaturity and complexity of big data platforms, and the need to get insights from all the data, much faster. Storage costs are increasing for companies engaging in Big Data Analytics initiatives. Even though the cost of storage hardware has been declining year-over-year, those declines are still not keeping pace with data growth. Today there are several ways to solve this storage space problem, some companies may choose to throw all that data on low-cost tape, some may choose a advanced data compression technique to make sure more data can be stored with less space, and some may choose to prune the old data and keep only those relevant data to manage space. On way to reduce the storage cost of big data is to reduce / mitigate the noise. From lot of studies, it’s evident that in most of the cases the signal-to-noise ratio is very low in big data. This means that, most of the data are noise (irrelevant data), and only a tiny fraction is the signal (relevant data). Though there are several ways to solve the big data storage space problem, there is no proven techniques that can efficiently reduce or mitigate the big data noise.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 4

Intelligent Reduction of Noise in Big Data with Report Based Filtering Technique


2. SOLUTION ARCHITECTURE

Before getting into the details of the approach, there is a brief overview as how a big data analytics platform today stores the data and thus helps to analyze the big data to develop analytics dashboard with a sample scenario.

Existing Data Indexing Approach:

Figure 1: Building Blocks of Analytics Dashboard in Big Data Platform

Figure 1 explains the process followed by big data analytics platform to index the data and generate analytics out of the indexed data. Once the data is indexed, the users are allowed to write pipes that analyzes the data and generate charts for visualization of data.

Sample Scenario for Consideration:

For a sample scenario, let's assume we are trying to build an analytics dashboard for monitoring the key system metrics like CPU, Memory and IO activities. Here is a sample input data received from the individual systems that we are trying to monitor, since we are not sure which data is useful for analytics we send all the relevant data for analytics from individual systems every 3 minutes.

{ "%user":40, "IFACE":"", "txkB/s":0, "datetime":1360038600000, "rxpck/s":15, "%commit":0, "

1


Page 02 of 4

%memused":60, "source":"vmhost2230", "%nice":10, "txpck/s":10, "ldavg-15":0, "rxkB/s":0, " %swpused":0, "bread/s":10, "bwrtn/s":432, "processes":[ { "%MEM":9.3, "COMMAND ":"/opt/IBM/WebSphere/AppServer/java/bin/java", "PID":28769, "USER":"root", "%CPU":0.3 }, { " %MEM":0.7, "COMMAND":"nautilus", "PID":2555, "USER":"root", "%CPU":0.1 }, { "%MEM":0, " COMMAND":"[ksoftirqd/0]", "PID":4, "USER":"root", "%CPU":0.1 } ], "txerr/s":0, "disks":[ { " Available":31794234, "Used":8223634, "Capacity":0.88, "Mounted_on":"/boot", "Filesystem ":"/dev/vda1", "1024-blocks":99150 }, { "Available":44163968, "Used":14530252, "Capacity":0.25, " Mounted_on":"/", "Filesystem":"/dev/vda2", "1024-blocks":61834620 }, { "Available":39612566, " Used":2763454, "Capacity":0.01, "Mounted_on":"/dev/shm", "Filesystem":"tmpfs", "1024-blocks ":1961532 } ], "%iowait":55, "ldavg-5":0, "rxerr/s":0, "ldavg-1":0, "%system":31 }

Let's see how a pipe is designed to retrieve the data from this sample data and generate visualization for CPU utilization over the period of time.

def cpu_usage(index, configuration):

events = search_datetimefacets(index, 'sysmonitor', query, ['%user','%system
','%nice'], interval) return chart_multiarea(events, keys)

A pipe is basically a user defined entity that allows searching the indexed data based on facets and allows plotting the searched data as visualization. As pipe is the building block of analytics dashboard, the pipe takes the final decision on...