Browse Prior Art Database

Identifying patterns in string data Disclosure Number: IPCOM000010942D
Original Publication Date: 2003-Feb-03
Included in the Prior Art Database: 2003-Feb-03
Document File: 2 page(s) / 41K

Publishing Venue



An approach to transform an arbitrary string into a floating point quantity is presented. This technique is important in allowing arbitrary string data to be plotted together with other numerate data, which facilitates the search for patterns and relationships through graphical means.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 50% of the total text.

Page 1 of 2

Identifying patterns in string data

Using graphical techniques - e.g., for correlation or trend analysis - is straightforward when applied to data of primitive types (e.g., int, long, double, ...) or wrapper types (e.g., Integer, :Long, ...). For example, it is not hard to produce a graph showing values changing over time. This technique is useful for detecting patterns and relationships in method argument values, return values and execution times. However, strings play a key role in arguments/return values of Java* and other OO languages, and it is not so straightforward looking for patterns in string values.

     This disclosure describes a general mechanism for coping with strings in analysis situations.

     The first approach is to consider attributes of the string, such as length(). This is quite effective as when looking for factors affecting performance, string length is often a key factor. So, for example a plot of string length versus method execution time could reveal an interesting correlation between these two that might suggest a tuning/improvement approach for the method in question relating to string length.

     A different approach is to characterise strings as unique values - e.g., through the hashcode() that can be developed from any Java string. Again, this is valuable, but is not completely satisfactory. One problem is that there is no guarantee that string "abc" and "abcd" are considered as close to each other in value. The hash function is completely arbitrary in its definition. Also, with this approach two distinct objects' hashCode()s are not necessarily distinct, which could cause apparent patterns to appear, which are false patterns in terms of the actual data.

     The proposed approach is to consider strings from the perspective of their actual values.

Every string is assigned a unique f...