Update: I haven’t tested it fully, but @eevee has a shorter suggestion that seems to work:
Tell the mysql driver that you’re getting BINARY back, then use .force_encoding(“UTF-8”)?
In a Rails application you can do this by setting encoding: binary in your database.yml file (I’m also using the mysql2 driver, not sure if that makes a difference). It seems to work (tested in Ruby 1.9.2) and it makes more sense, but I’ll update if I come across any issues!
Original post follows.
Recently I had to deal with a MySQL database where the character encoding was set to latin1 (the default) but the bytes sent to it were UTF-8. When this happens you get horribly messy strings coming out of your database if you try and access it with anything sane. For example “ümlaut” becomes “ümlaut”. What a mess!
In any normal situation you would fix the data storage itself, but I had no control over the database so I had to work out another way of fixing it on my side. And I managed to do it!
Some background
Latin1 is a character encoding used by MySQL. People mistakenly think it is equivalent to ISO-8859-1 but it isn’t, it’s actually CP-1252 (also known as Windows-1252). CP-1252 is a superset of ISO-8859-1 with some additional characters (recently including the € symbol).
We can simulate what happens in MySQL when you forcibly send it UTF-8 with the following Ruby snippet:
1 2 | "ümlaut".force_encoding("cp1252").encode("UTF-8") # => "ümlaut" |
Here, we start with UTF-8, then we tell Ruby that it’s actually CP-1252. It isn’t, so that just results in garbage. We then ask Ruby to give us that encoded back in to UTF-8.
When I realised this (thanks to Yehuda Katz for some pointers in his blog post about Ruby 1.9 Encodings) I tried running the reverse process against the text I was trying to fix:
1 2 | "ümlaut".encode("cp1252").force_encoding("UTF-8") # => "ümlaut" |
Success! However when I put that code into my application and ran it across a much larger set of data, I encountered a new problem:
1 2 3 4 | "(“What do you think?â€\u009D)".encode("cp1252") .force_encoding("UTF-8") # => Encoding::UndefinedConversionError: U+009D to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 |
Clearly strings coming out of MySQL aren’t strictly CP-1252.
It turns out that Windows API methods (and thus, perhaps, others) will accept certain decimal code values for characters that aren’t strictly accepted in the CP-1252 standard. Just 5 of them: 81, 8D, 8F, 90, and 9D. Looking at the error above, 9D is referred to so that must be what is going on!
I’m not 100% sure, but what I suspect is happening is MySQL is happily storing the 9D byte it receives, which when read by Ruby as UTF-8 comes out as the Unicode byte character \u009D. But Ruby is strict about encoding CP-1252 and won’t encode that byte because it considers it invalid. However if you drop those invalid bytes the resulting text won’t come out properly, because that byte is actually part of a multibyte Unicode character, so needs to be passed across to the encoded string.
To do this, you can use the badly documented :fallback option in the encode method. :fallback takes a hash where the key is the invalid character and the value is the character to replace it with. We want to replace all instances of \u009D etc. with \x9D (an escaped single byte character). One final caveat is that for Ruby to accept that as a replacement character, you must use force_encoding("cp1252")
(essentially overriding Ruby’s lack of conversion). The resulting code looks like this:
1 2 3 4 5 6 7 8 9 10 11 | def fix_cp1252_utf8(text) text.encode('cp1252', :fallback => { "\u0081" => "\x81".force_encoding("cp1252"), "\u008D" => "\x8D".force_encoding("cp1252"), "\u008F" => "\x8F".force_encoding("cp1252"), "\u0090" => "\x90".force_encoding("cp1252"), "\u009D" => "\x9D".force_encoding("cp1252") }) .force_encoding("utf-8") end |
A bit messy, but it gets the job done. I believe Ruby 1.9.2 will not accept a dynamic hash in the :fallback option for some weird reason, but Ruby 1.9.3 will. Just bear that in mind.
In action:
1 2 | fix_cp1252_utf8("(“What do you think?â€\u009D)") # => "(“What do you think?”)" |
Hurrah!
One really important caveat: If you pass that method nice tidy UTF-8 it will screw up the text. I haven’t figured out a way of stopping that; so be careful!