Method on detect duplicate files based on different encoding

The data can be stored in different encoding scheme like Unicode, EBCDIC, ASCII, etc. If the data in different encoding scheme are stored in database. It leads to a duplication of storage,and more importantly it will affect the performance and outcome accuracy when searching in the database. This duplication can easily be avoided by checking all the data's encoding scheme before storing into the database.

                                                               of the files in a database. With the new mechanism, the database administrator can get rid of much more unwanted data just carry out a conversion to Unicode before the regular comparison .

The flow would be like this.


In this data explosive age, the data stored in a database is increased exponentially year by year. However many of them are essentially "same", which means the binary content of the files are not same, but when convert the content to Unicode, they may exactly the same or 90% same. This is obviously common in the database behind the website has some reward mechanism for the subscriber to upload much more database. This situation is not only waste a lot of storage, but also affects users who search the files in those database. With this new method, we can check the content of the file and decode to Unicode, then make a comparison with the existing ones. If the d...