Browse Prior Art Database

CONVERTING STRING DATA TO FIXED LENGTH NUMERICAL VECTORS

IP.com Disclosure Number: IPCOM000245549D
Publication Date: 2016-Mar-16
Document File: 8 page(s) / 179K

Publishing Venue

The IP.com Prior Art Database

Related People

Tomas Pevny: AUTHOR [+2]

Abstract

Techniques are provided for converting Uniform Resource Locators (URLs), or other strings in network security and management domains, to fixed length numerical vectors. This enables anomaly detectors, classifiers, and other such machine learning techniques to be applied to data strings in an intrusion detection system (IDS) or intrusion prevention system (IPS). Furthermore, the techniques are computationally efficient, memory efficient, and simplistic, such that the techniques can be applied, without modification, to a wide variety of string data commonly encountered in network security and management operation. By comparison, other techniques must utilize external data sources, such as up-to-date language dictionaries, to be applied to different types of string data.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 19% of the total text.

Page 01 of 8

CONVERTING STRING DATA TO FIXED LENGTH NUMERICAL VECTORS

AUTHORS:

Tomas Pevny

Martin Kopp

CISCO SYSTEMS, INC.

ABSTRACT

    Techniques are provided for converting Uniform Resource Locators (URLs), or other strings in network security and management domains, to fixed length numerical vectors. This enables anomaly detectors, classifiers, and other such machine learning techniques to be applied to data strings in an intrusion detection system (IDS) or intrusion prevention system (IPS). Furthermore, the techniques are computationally efficient, memory efficient, and simplistic, such that the techniques can be applied, without modification, to a wide variety of string data commonly encountered in network security and management operation. By comparison, other techniques must utilize external data sources, such as up-to-date language dictionaries, to be applied to different types of string data.

DETAILED DESCRIPTION

    One issue that frequently prevents machine learning methods from being used to detect anomalies or already known threats from string representations of data (referred to herein as "string data") common in networking, such as email protocol headers and Uniform Resource Locators ("URLs") in Hypertext Transfer Protocol (Secure) ("HTTP(S)") requests, is that string data is not uniformly represented. For example, a URL consists of multiple parts with different meanings (e.g., domain, path, query), and each part can have different length due to a different number of tokens (e.g., sub- domains, parameters, directory levels). Consequently, it is very difficult or impossible to detect similarities (or differences) between different strings.

    Although it is possible to determine that some strings contain similar substrings, or similar paths, or are similar in the number of subpaths, etc., there is no efficient way to compute a well defined numerical distance value based directly on these string content

Copyright 2016 Cisco Systems, Inc.
1


Page 02 of 8

observations. Moreover, using obvious string similarities ignores possibly crucial structures in the strings that human analysts may be unaware of. Since most machine- learning methods require numerical vectors of fixed length as input, converting string data to this representation without loss of information is desirable. For example, a well- defined transformation from the input string space into a standard vector space helps to transform the similarity problem to the standard distance evaluation on vector spaces for which, e.g., Euclidean distance or other norms can be used.

    In many instances, attempts to convert string data to uniform representations are based on natural language processing (NLP). NLP offers various approaches to convert words/documents to a real vector of fixed length (e.g., word2vec). However, NLP approaches have several drawbacks that effectively prohibit their use in an Intrusion Detections System (IDS) or intrusion prevention system (IPS), which must operate in or ne...