Browse Prior Art Database

METHOD TO ANONYMIZE AND ANALYZE DATA FIELDS USING LEVEL HASHED TREES

IP.com Disclosure Number: IPCOM000233799D
Publication Date: 2013-Dec-20

Publishing Venue

The IP.com Prior Art Database

Related People

Sashank Dara: AUTHOR

Abstract

Techniques are presented herein for splitting data fields that have delimiters into members, hashes of subsets of members are calculated and a tree is constructed. Each level of the tree would have hashes of respective substrings of various degrees. Such "level hashed" trees (LHTs) are used as anonymized counter parts of data fields. Comparing hashes in "level hashed" trees results in partial, prefix, suffix matches in the data fields.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 23% of the total text.

Page 01 of 11

METHOD TO ANONYMIZE AND ANALYZE DATA FIELDS USING LEVEL HASHED TREES

AUTHOR:

Sashank Dara

CISCO SYSTEMS, INC.

ABSTRACT

    Techniques are presented herein for splitting data fields that have delimiters into members, hashes of subsets of members are calculated and a tree is constructed. Each level of the tree would have hashes of respective substrings of various degrees. Such "level hashed" trees (LHTs) are used as anonymized counter parts of data fields. Comparing hashes in "level hashed" trees results in partial, prefix, suffix matches in the data fields.

DETAILED DESCRIPTION

    Data field anonymization is an important privacy requirement for customers of cloud based solutions.

    Data fields need to be anonymized before they are uploaded into the cloud to utilize any cloud service provided by cloud service provider. The anonymization process needs to occur within an enterprise network. Using conventional anonymization techniques like hashing or encryption, the cloud service provider loses the ability to do any meaningful analysis/correlation on the anonymized data.

    For example, network telemetry is uploaded by the enterprise network devices to the cloud and further threat correlation is done in the cloud. If the data fields are anonymized in a conventional way, the cloud may not be able to do any correlation on such fields.

    There is a need to design techniques that anonymize the data fields but allow for minimum operations on the anonymized fields for analysis/correlation. This further raises more problems to solve.

Copyright 2013 Cisco Systems, Inc.

1


Page 02 of 11

    Case 1. An anonymized field may need to be correlated with a non-anonymized field. For example, an anonymized email address needs to be partially matched with public domain names.

    Case 2. An anonymized field may need to be correlated with another anonymized field of another record. For example, an anonymized private internal IP Address from NetFlow records may need to be compared with anonymized private Internal IP Address of a Web Security Appliance (WSA) Log.

    Case 3. An anonymized field may need to be correlated with another anonymized field where the anonymization is done by different organizations. For example, a filename malware.exe anonymized by Org A may need to be compared with filename malware.exe anonymized by Org B.

    There are challenges with conventional anonymization techniques. If anonymization is based on encryption techniques, then only Case 2 in the above can be solved. Case 1 cannot be solved because the cloud service provider may not know the encryption key used. Case 3 cannot be solved because the encryption keys of Org A and Org B may be different. If anonymization is based on Hashing techniques, then only full string matches can be done by comparing hashes of the complete strings. Sub-string matches, prefix matches, suffix matches etc., may not be possible based on trivial hashing.

    The foregoing techniques work well for data fields that have delimiters i...