UTF-8 manipulations using C++

I'm writing a program that manipulates html files which are encoded in UTF-8, in particular they include arabic characters. With copying entire files, char type works just fine, but when I try to extract certain portions of the text, things get messy. I understand why that poses a problem and I've done some research online but I haven't been able to find a clear approach to deal with UTF-8.

I'm writing C++ using the freely available Dev-C++ IDE, and the program is a console application.

I hope someone can help out...
closed account (z05DSL3A)
In what way do 'thing get messy'?

In '...clear approach to deal with UTF-8' do you mean encoding and decoding UTF-8, if so this may help:

http://www.codeproject.com/KB/string/UTF8.aspx

It may also be worth looking up the RFC for UTF-8 I think it is RFC2279, google it.
Here's a link that might help you.
http://evanjones.ca/unicode-in-c.html

Essentially, you should use std::wstring to handle all your strings, and set the locale to manage UTF-8 input and output.

Hope this helps.
Topic archived. No new replies allowed.