UTF-8 encoding/decoding in C

Published 27 September, 2009 in Programming - 0 Comments

I was working on a simple database to Excel XML exporter the other day and decided to write it in C. Now, the problem was that since the Swedish language contains non-ascii characters the output needs to be UTF-8 encoded. C doesn’t have a built-in function for this – it seems I should add since I’m a C rookie – and no matter how I searched at Google I couldn’t find anything useful. So I thought…

Look at PHP

…why not look at the source code of PHP and see how the PHP functions utf8_encode and utf8_decode are being done. So I downloaded the source of PHP and with a little find . -name *.c -print | xargs grep "utf8_encode" I found the functions in xml.c. Thankfully they weren’t too complicated – when dug out from the rest of the XML functions – so I didn’t take too long before I had them as standalone functions.

This is how they are used:

12 lines of C/C++
  1. #include “utf8.h”
  2. int main(int argc, char **argv)
  3. {
  4. char *iso_str = “Pontus Östlund”;
  5. char *utf8_str;
  6. utf8_str = utf8_encode(iso_str);
  7. iso_str = utf8_decode(utf8_str);
  8. return 0;
  9. }

And it seems to be working quite OK!

Sources at Github