sqlite3_unicode update to UNICODE v5.2.0

Lately I have been hyped about encryption and specifically sqlite3 database encryption,  so I decided to update my test application with the  latest version of sqlite3 (v3.6.23.1 at the time) and started reading segments of sqlite3 code to get a feeling how everything relating to encryption is glued together. Good thing I had located SQLCipher during my search and got a heads up.

In any case, while testing the encryption / decryption functionality, I identified some database integrity problems arising from indexes created with my slightly modified sqlite3_unicode library. A bug in the collation function would cause the string comparison function to return incorrect results both for UTF-8 and UTF-16 comparisons, and as such would report errors for missing rows in the indexes during the integrity check. This was mainly due to the fact that the string size parameter provided by sqlite3 is actually the number of bytes and not the string length.

On another note UNICODE v5.2.0 has been released for some time now, but I kept postponing any update to sqlite3_unicode library since there was no need. So with the release of the above bug fixes I decided to update the library unicode tables as well.

The update to v5.2.0.15 is recommended for users of all prior versions, to eliminate database integrity errors arising by indexing tables using prior versions of this library. Tables having indexes using the sqlite3_unicode collation will have to be reindexed using the revised string comparison function. The updated source is included for grabs in the download section )

sqlite3_unicode updated for SQLite3 v3.6.7

Since v3.6.1, SQLite3 amalgamation package includes the ICU (International Components for Unicode) source code to make it easier to build SDLite3 with the ICU extension enabled, see [check-in].

This change has broken sqlite3_unicode source code and could no longer be automatically loaded by hijacking the ICU start-up function. For this reason sqlite3_unicode has been updated (to build 10) which is fully compatible with SQLite v3.6.7, but must be manually loaded using custom exported functions, which are defined in a separate sqlite3_unicode.h header file, and explained below.

This version has been tested to build under Microsoft Windows and Ubuntu 8.10 Intrepid and work flawlessly either as a static or shared library.

A bug was fixed in the NOCASE collation source that prevented proper sorting of UTF-8 encoded strings.
NOCASE collation now works for both UTF-8 and UTF-16 encoded strings, but unlike ICU it does not take into account any language specific sorting logarithms.

To activate unicode functionality the following function must be called during application initialization sequence: sqlite3_unicode_load();
The function does not take any arguments and simply calls sqlite3_auto_extension(); function to load the extension.

Similarly before application exit the following function must be called to cleanup memory and unload extension: sqlite3_unicode_free();
The function does not take any arguments and simply calls sqlite3_reset_auto_extension(); function to unload the extension.

The library has been thouroughly tested for Greek language through a small greek-english dictionary project, designed to test:

  1. the case folding conversion,
  2. the LIKE operation in a SELECT statement, and
  3. the NOCASE collation (using case folding and unaccenting of greek strings to enable proper sorting).

Download the implementation file sqlite3_unicode.c

SQLite and native UNICODE LIKE support in C/C++

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.

SQLite is the most widely deployed SQL database engine in the world. It is used in countless desktop computer applications as well as consumer electronic devices including cellphones, PDAs, and MP3 players. The source code for SQLite is in the public domain.

Please, rest assured that SQLite does indeed have UNICODE (UTF-8, UTF-16) support, but…

If you have searched and resulted in reading this post, then you have probably realized the true facts of life; the limitations of SQLite in regards to native non case-sensitive UNICODE text comparison and especially the use of the LIKE operator which has been crippled.

Read the rest of this entry »