SQLite and native UNICODE LIKE support in C/C++
SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.
SQLite is the most widely deployed SQL database engine in the world. It is used in countless desktop computer applications as well as consumer electronic devices including cellphones, PDAs, and MP3 players. The source code for SQLite is in the public domain.
Please, rest assured that SQLite does indeed have UNICODE (UTF-8, UTF-16) support, but…
If you have searched and resulted in reading this post, then you have probably realized the true facts of life; the limitations of SQLite in regards to native non case-sensitive UNICODE text comparison and especially the use of the LIKE operator which has been crippled.
Traditionally non case-sensitive string comparison, would mean that, strings being compared would be transformed into lowercase before comparison, and then an incremental binary comparison loop wound be made for each byte constituting the strings to determine whether the strings were identical or not.
Unfortunately SQLite uses C/POSIX functions tolower() and toupper() to make string transformations and subsequently non case-sensitive string comparisons, which have no native UNICODE equivalents but instead are locale-specific.
As specified in MSDN library :
The case conversion of
tolower()is locale-specific. Only the characters relevant to the current locale are changed in case. The functions without the _l suffix use the currently set locale. The versions of these functions with the _l suffix take the locale as a parameter and use that instead of the currently set locale.
Therefore, a problem arises where locale-specific case mappings do not agree with UNICODE case mappings for each language family character.
ICU people have faithfully followed the UNICODE standard and produced open-source libraries, which have been also used by SQLite to optionally provide native UNICODE support. The disadvantage of this library is that it is too darn big ~10mb of compiled binary libraries to be used in association with SQLite. Even by omitting features I couldn’t get it to build to a sensible binary size
added the fact that i despise having dependencies in dynamically linked libraries that i have to ship as well.
Therefore i decided to extract the case folding tables from the UNICODE standard, similarly to what SQLite developers have done for lowercase mappings of the ASCII table, and implement the functionality in SQLite.
The following file may be build as a separate dynamic library or a static library to be compiled directly in your SQLite compilation.
It uses the already existing ICU infrastructure built in SQLite in order to unleash its power.
Build with SQLITE_CORE and SQLITE_ENABLE_ICU preprocessor definitions.
Hopefully there are *no significant* errors in the code, if you find one please leave a comment below.
Download the implementation file
sqlite3_unicode.c
v5.1.0.4
Updated today to include a NOCASE collation override, for sorting unicode strings.
nocase collation does not always work corrrectly.
example:
sqlite> create table song (name text,id number);
sqlite> insert into song values(’Paw’,1);
sqlite> insert into song values(’Pallas’,2);
sqlite> select * from song;
Paw|1
Pallas|2
sqlite> select * from song order by name;
Pallas|2
Paw|1
sqlite> select * from song order by name collate nocase;
Paw|1
Pallas|2
this fixes the problem:
SQLITE_PRIVATE int sqlite3StrNICmp(const char *zLeft, const char *zRight, int N){
unsigned char *a = (unsigned char *)zLeft;
unsigned char *b = (unsigned char *)zRight;
signed int ua = 0, ub = 0;
while(N– && ua==ub)
{
ua = unicode_fold(sqlite3Utf8Read(a, 0, &a));
ub = unicode_fold(sqlite3Utf8Read(b, 0, &b));
if(!ua || !ub)
break;
}
return ua - ub;
}
and works even if Zleft and zRoght are 0 terminated and shorter than N.
Thank you for noticing.
I have updated the file that fixes the above issue and incremented the file to rev 7
Do LIKE and GLOB support indexed search optimization for Unicode as well?
Yes indexed search optimizations should work for Unicode as well as long as you don’t mask your column names with any functions (eg. unaccent(column_name) which overrides the optimization by default)
How to use it ?
I have AMALGAMATION version
Just add the file in your project and compile the project with the processor definitions below
SQLITE_CORE and SQLITE_ENABLE_ICU
That should be enough.
(I personally prefer to build SQLite in a separate project that produces a library file which is statically linked by my executable project)