> function to convert text string to utf-8 encoded one?
No conversion is required; in a sequence of bytes (char), whether a. each byte represents a distinct character, or b. sub-sequences of one or more bytes represents a single character, is merely a matter of interpretation. In standard C++, this interpretation is usually done by the codecvt facet of the locale in effect.
The type of a plain string literal "hello\\U000031F3" and a UTF-8 encoded string literal u8"helloć³"
are both array of const char - char[].
#include <iostream>
#include <string>
#include <locale>
#include <fstream>
int main()
{
// no conversion is required, just use std::string to hold the bytes in a multi-byte utf-8 string
const std::string str = "a \U00000062" // one byte (octect) each
" \U000000BE \u011C" // two bytes each (one byte each for space)
" \u20AC \U000031F3" ; // three bytes each (one byte for space)
// input and output work as expected, if we set the stream's locale to a utf-8 locale
std::cout.imbue( std::locale( "C.UTF-8" ) ) ; // set the stream's locale to UTF-8
std::cout << str << '\n' ;
std::ifstream this_file( __FILE__ ) ;
this_file.imbue( std::locale( "C.UTF-8" ) ) ; // set the stream's locale to UTF-8
std::string line ;
for( int i = 0 ; i<5 ; ++i ) if( std::getline( this_file, line ) ) std::cout << i << ". " << line << '\n' ;
std::locale::global( std::locale( "C.UTF-8" ) ) ; // set the default (global) locale if we want utf-8 for all new streams
std::ofstream( "test_utf8.txt" ) << "file test_utf8.txt: " << str << '\n' ; // the newly-construct stream imbues the global locale
// however, string operations size(), [], substr() etc. operate on bytes and not utf-8 characters
// and string iterators iterate over each byte, not each utf-8 character.
std::cout << "size in bytes: " << str.size() << '\n' ; // size in bytes, not characters
unsignedchar c = str[6] ; std::cout << "byte at str[6]: " << std::hex << std::showbase << int(c) << '\n' ; // byte, not character
for( unsignedchar byte : str ) std::cout << int(byte) << ' ' ; // iterates over bytes, not characters (note: byte-order)
std::cout << '\n' ;
}
C++11 does not have convenient mechanisms to access the individual utf-8 characters in a sequence of char, or to take care of byte-ordering and BOM markers seamlessly. There are many libraries floating around that make this possible; a library that uses idiomatic C++ constructs would make things easy.
Thanks to all of You guys for the attention and time taken in my still-true-beginner-at-C++ topic.
I have some understanding of C, and some hobby practice - that's all.
C++ is new to me in all aspects and I can't rely to run fast after reading a book on it and the tutorials on this cite. Operators overloads, templates of functions and classes might be easy to swallow in general terms, but when I look at their real implementation into libraries and references it is quite difficult at this point.
You see now that I can hardly take good advantage of your solid and experienced advices given generously. So maybe it is better to just explain what I am doing not to confuse you further and waste your time, and after all that to get back to basics of libraries and references here, can't learn that fast really.
I got interested in building a program that facilitates me compose tracks in Google Earth and a like apps ( Oruxmap on Android ). My resource is a folder with kml-files - simple utf-8 encoded tracks build manually in Google Earth environment or automatically while in motion with handheld smartphone with gps ( before mentioned app does that quite well ). All those tracks present a spider net if simultaneously opened in Google Earth. Goal idea was to automate the process of composing new track connecting any two nodes on my tracks-net, under the simple criteria of minimizing the distance traveled.
So ... I registered here, downloaded MS Community 2013 and got coding, mostly the C-way, without nested classes, just functions operating over statically reserved database. I did it with some recursivity and became glad of the result. It was working fine with the test database. Problem was importing real data, and exporting it after the manipulation.
I copy-pasted some code to help me reading the directory with files,
then again some to decode their utf-8 to text ( big thanks to Duoas here! ), and finally I needed some code to encode the solution track to utf-8 back again ( reason for this topic ).
It came out that for my machine: sizeof() gives 1 for char, unsigned char and signed char, 2 for wchar_t and 4 for unsigned and I had some time in types/files conversion tactics. But the really good news was that those simple kml files I use are consisted of standard ASCII characters after all, or a byte of utf-8 w/o need to be encoded/coded - my program works quite well for my hobby standards exporting directly the text solution file into kml file. About two weeks for 500-600 lines of code :)
Finally, I will rely further to my technical curiosity ( not being pro-programmer ), to explore your advices and methods for char sets conversion/ strings manipulation, might need them for something else, hard thinking is pleasureful sometimes you know :)