matt ryall’s weblog

Fun for the whole family since 2002.

Site

Portrait of Matt Ryall

 

About me

Feed icon Articles feed

Feed icon Comments feed

Archive

Photography

Europe trip 2004

More photos

Software

NoteWiki

Other Pages

About Me

Uni timetable

SysProg Journal

The List

The infamous Turkish locale bug

11 February 2009

I discovered a quirky comment today in Confluence’s Permission.forName(String) method:

// use the english locale to avoid the infamous turkish locale bug
String upperName = permissionName.toUpperCase(Locale.ENGLISH);

Naturally the question popped into my mind: what is the ‘infamous Turkish locale bug’? Looking into the JIRA issues related to the commit (CONF-5931, CONF-7168), I found a link Agnes put to this article about a common Java bug in the Turkish locale: Turkish Java Needs Special Brewing.

In the Turkish alphabet there are two letters for ‘i’, dotless and dotted. The problem is that the dotless ‘i’ in lowercase becomes the dotless in uppercase. At first glance this wouldn’t appear to be a problem; however, the problem lies in what programmers do with upper- and lowercases in their code.

The two lowercase letters are \u0069 ‘i’ and \u0131 ‘?’ (dotless ‘I’) and are totally unrelated. Their uppercase versions are \u0130 ‘?’ (capital letter ‘I’ with dot above it) and \u0049 ‘I’. The issue is that this behavior does not occur in English where the single lowercase dotted ‘i’ becomes an uppercase dotless ‘I’.

With the statement String.toUppercase(), most Java programmers try to effectively neutralize case. Consider a HashMap with string keys and you have a key that you want to look up. If you want to ignore case, you’ll probably uppercase everything going into the map, its entries, and the string you’re doing the lookup with. This works fine for English, but not for Turkish, where dotless becomes dotless.

This is a nice example of where you need to be very careful how you handle upper- and lower-casing in your application. Changing the word ‘quit’ to uppercase in the Turkish locale will result in ‘QU?T’, not ‘QUIT’. I’ve heard of other examples where the German ß (sharp ‘s’) doesn’t behave exactly as English speakers would expect either.

There are two ways to properly perform a case-insensitive comparison of Strings in Java in any locale:

  • (preferred) use String.equalsIgnoreCase()
  • use a fixed locale (like Locale.ENGLISH) as an argument to String.toUpperCase(Locale) or String.toLowerCase(Locale).

You can also use Character.toLowerCase() or Character.toUpperCase() to derive a locale-independent case-insensitive String value. This was the solution used in a recent (and still unreleased) fix for the same problem in the Commons Collections CaseInsensitiveMap.

 
Posted by James Roper at 2009-02-11 09:59:07
I reckon the bug is in the character set. If i capitalises ?, then clearly, it is a different character to the i that capitalises to I, especially if there is also another character ? that capitalises to I. If the 3 different i’s were all treated separately, we wouldn’t have this problem.
 

Comments on this article have been closed.