From utf-8 string to words(substrings)

Forum

Forum
Beginners
From utf-8 string to words(substrings)

From utf-8 string to words(substrings)

Jan 25, 2009 at 4:15pm

Hello all.
I have a utf-8 string(unicode), line string and strtok dont works with utf-8 stings. Is a way to take the words from utf-8 line?

1
2
3

string line;
getline(fs8, line) // fs8 for open file
....

Jan 25, 2009 at 4:25pm

Bazzy (6281)

I think you should use wide character strings ( wstring )

Jan 25, 2009 at 9:24pm

dkaip (196)

But how. I use code::blocks and GCC. wstring dot't have getline function to take a line from file.And then there is not strtok for wide chars.
Whith string and getline i take lines. The file is already utf-8. Then?

Last edited on Jan 25, 2009 at 9:25pm

Jan 25, 2009 at 11:09pm

Bazzy (6281)

//To get a wide line
wfstream fs8;
wstring line;
getline( fs8, line);

//To store words in a vector from wide string
vector<wstring>words;
wstring::size_type pos;
while (true)
{
	pos = line.find(L' ');
	if ( pos != wstring::npos )
	{
		words.push_back(line.substr(0,pos));               
		line.erase(0,pos+1);//notice that this will modify your starting string                
	}else
	{
		words.push_back(line);
		break;
	}
}

Last edited on Jan 25, 2009 at 11:09pm

Jan 26, 2009 at 1:18am

Duthomhas (13253)

Last I checked, wstring doesn't do UTF-8. While STL streams are specifically designed to handle such things, the prescribed ones don't.

You need to convert the UTF-8 to the standard wchar_t strings. It isn't actually too difficult, but if all you want is a quick answer, I recommend you to the GNU iconv() library (libiconv)
http://www.gnu.org/software/libiconv/

Once your UTF-8 string data is converted to a wstring, you can then use all the usual find() methods and string functions like getline() over wstringstreams.

Hope this helps.

[edit]
Hey, here's something that may be more useful:

http://www.icu-project.org/

[/edit]

Last edited on Jan 26, 2009 at 1:35am

Jan 26, 2009 at 7:15am

dkaip (196)

I have ancient Greek text file in notepad at windows xp. This file can be saved as utf-8 or unicode. Actually opening in notepad++ can convert all this very easy. Here a code from user helios for converting.
But when wfstream fs8;wstring line;getline( fs8, line); file fs8 is already utf-8. The line isn't ?
Is something practical for doing this?

#define BOM8A 0xEF
#define BOM8B 0xBB
#define BOM8C 0xBF 

wchar_t *UTF8_to_WChar(const char *string){
	long b=0,
		c=0;
	if ((uchar)string[0]==BOM8A && (uchar)string[1]==BOM8B && (uchar)string[2]==BOM8C)
		string+=3;
	for (const char *a=string;*a;a++)
		if (((uchar)*a)<128 || (*a&192)==192)
			c++;
	wchar_t *res=new wchar_t[c+1];
	res[c]=0;
	for (uchar *a=(uchar*)string;*a;a++){
		if (!(*a&128))
			//Byte represents an ASCII character. Direct copy will do.
			res[b]=*a;
		else if ((*a&192)==128)
			//Byte is the middle of an encoded character. Ignore.
			continue;
		else if ((*a&224)==192)
			//Byte represents the start of an encoded character in the range
			//U+0080 to U+07FF
			res[b]=((*a&31)<<6)|a[1]&63;
		else if ((*a&240)==224)
			//Byte represents the start of an encoded character in the range
			//U+07FF to U+FFFF
			res[b]=((*a&15)<<12)|((a[1]&63)<<6)|a[2]&63;
		else if ((*a&248)==240){
			//Byte represents the start of an encoded character beyond the
			//U+FFFF limit of 16-bit integers
			res[b]='?';
		}
		b++;
	}
	return res;
}

Last edited on Jan 26, 2009 at 7:40am

Jan 26, 2009 at 10:42pm

Duthomhas (13253)

C++ streams have no concept of encoding characteristics --each element is considered an independent entity.

Hence, when you use any of the STL iostreams to read a UTF-8 sequence, it is not decoded into the proper characters. (Even the stinkin' wstream objects can't do that.)

For example, if you save the following, using Notepad (or Notepad++, presumably) with "UTF-8" in the encoding combobox of the Save As dialogue, you will get a little UTF-8 file, including the obnoxious BOM that Windows programs add to UTF-8 files.

Hello world! What's up? ¡Hola mundo! ¿Qué pasa?

Here is an example of how to use C++ to convert such a file into a wchar_t stream (string or file).

// utf8-to-wchar_t.cpp
//
// This program is an example of how to read a UTF-8 encoded file into a 
// wchar_t sequence (be it a string or, as in this case, another file).
//

#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
using namespace std;

//----------------------------------------------------------------------------
// Here's a little consumer-transformer following the STL design philosophy.
// Notice how, since UTF-8 is bound to specific bit-patterns, our types are
// only generic in what the input and output containers are.
//
// For more on the UTF-8 layout, see
//
//   http://en.wikipedia.org/wiki/Utf8
//
// Specifically,
//   0xxxxxxx                            --> 00000000 00000000 xxxxxxxx
//   110yyyyy 10xxxxxx                   --> 00000000 00000yyy yyxxxxxx
//   1110zzzz 10yyyyyy 10xxxxxx          --> 00000000 zzzzyyyy yyxxxxxx
//   11110www 10zzzzzz 10yyyyyy 10xxxxxx --> 000wwwzz zzzzyyyy yyxxxxxx
//
// Notice how the first form is identical to ASCII.
//
// This algorithm does NOT consider whether or not your wchar_t is large
// enough to hold a 21-bit character. (UTF-8 is specified over U+0000 to
// U+10FFFF. Most modern C++ compilers use a 32-bit wchar_t, particularly
// on Linux, but some older ones still have a 16-bit wchar_t, truncating
// the range to U+0000 to U+FFFF.)
//
template <
  typename InputIterator,
  typename OutputIterator
  >
OutputIterator utf8_to_wchar_t(
  InputIterator  begin,
  InputIterator  end,
  OutputIterator result
  ) {
  for (; begin != end; ++begin, ++result)
    {
    int      count = 0;       // the number of bytes in the UTF-8 sequence
    unsigned c     = (unsigned char)*begin;
    unsigned i     = 0x80;

    // Skip the stupid UTF-8 BOM that Windows programs add
    //
    // (And yes, we have to do it here like this due to problems
    // that iostream iterators have with multiple data accesses.)
    //
    // Note that 0xEF is an illegal UTF-8 code, so it is safe to have
    // this check in the loop.
    //
    if (c == 0xEF)
      c = (unsigned char)* ++ ++ ++begin;

    // Resynchronize after errors (which shouldn't happen)
    while ((c & 0xC0) == 0x80)
      c = (unsigned char)*++begin;

    // Now we count the number of bytes in the sequence...
    for (; c & i; i >>= 1) ++count;
    // ...and strip the high-code-bits from the character value
    c &= i - 1;

    // Now we build the resulting wchar_t by
    // appending all the character bits together
    for (; count > 1; --count)
      {
      c <<= 6;
      c |=  (*++begin) & 0x3F;
      }

    // And we store the result in the output container
    *result = c;
    }

  // The usual generic stuff
  return result;
  }

//----------------------------------------------------------------------------
int complain( const char* filename, const char* method )
  {
  cerr
    << "I could not open the file \""
    << filename
    << "\" for "
    << method
    << endl;
  return 1;
  }

//----------------------------------------------------------------------------
// This little type is to help with actual wide streams (since the STL doesn't
// have any -- see widen() and narrow() for all the disappointing details).
//
struct widechar
  {
  typedef enum { big_endian, little_endian } endianness_t;
  unsigned value;

  widechar( unsigned value = 0 ): value( value ) { }

  static endianness_t endianness()                          { return e;       }
  static void         endianness( endianness_t endianness ) { e = endianness; }

  private: static endianness_t e; 
  };

widechar::endianness_t widechar::e = widechar::big_endian;

//............................................................................
ostream& operator << ( ostream& outs, widechar wc )
  {
  if (wc.endianness() == widechar::little_endian)
    for (int i = 0; i < 4; ++i)
      {
      outs << (char)(wc.value & 0xFF);
      wc.value >>= 8;
      }

  else
    for (int i = 24; i >= 0; i -= 8)
      {
      outs << (char)((wc.value >> i) & 0xFF);
      }

  return outs;
  }

//----------------------------------------------------------------------------
int main( int argc, char** argv )
  {
  // If necessary, give the user instructions
  if (argc < 3)
    {
    cout <<
      "Convert a UTF-8 file to a wchar file.\n"
      "usage:\n  " << argv[ 0 ] << " UTF8-FILENAME WCHAR-FILENAME\n";

    return 1;
    }

  // Otherwise, convert the named UTF-8 input file to the named wchar_t output
  ifstream inf(  argv[ 1 ], ios::binary );
  ofstream outf( argv[ 2 ], ios::binary );

  if (!inf)  return complain( argv[ 1 ], "reading" );
  if (!outf) return complain( argv[ 2 ], "writing" );
  inf >> noskipws;  // We want all data (including spaces, newlines, etc).

  // This will help on Win32; the command prompt will display a little-endian
  // stream correctly, but it will display a big-endian stream with some garbage.
  widechar::endianness( widechar::little_endian );

  outf << (widechar)0x0000FEFF;  // byte order mark

  // Here I use a iostream iterator directly, but any appropriate sequence
  // container will do. You can convert std::strings or whatever you like
  // in the usual way.
  //
  utf8_to_wchar_t(
    istream_iterator <char>     (inf),
    istream_iterator <char>     (),
    ostream_iterator <widechar> (outf)
    );

  outf.close();
  inf .close();

  //..........................................................................
  // Here's an example using a wstring sequence
  //
  // Again, iostream_iterators play havoc with streams, so we just reopen
  // the file to play safe.
  inf.open( argv[ 1 ], ios::binary );
  inf >> noskipws;

  // For each line of text...
  string line;
  unsigned line_number = 1;
  while (getline( inf, line ))
    {
    // ...First convert it to a wstring
    wstring wline;
    utf8_to_wchar_t(
      line.begin(),
      line.end(),
      back_insert_iterator <wstring> (wline)
      );

    // Then see if it has the Spanish leading-question mark (¿) in it
    wstring::size_type index = wline.find( (wchar_t)0xBF );
    cout << "line " << line_number << ": ";
    if (index == wstring::npos)
      cout << "the upside-down question-mark does not appear in this line.\n";
    else
      cout << "the upside-down question-mark is at index " << (index + 1) << "\n";

    ++line_number;
    }

  inf.close();

  return 0;
  }

// end utf8-to-wchar_t.cpp

Edit & run on cpp.sh

This code just converts UTF-8 to wchar_t, it does not go the other way.
If you want to convert wchar_t to UTF-8, it is very much the same process (though a bit easier, since the input stream is not coded).

Hope this helps.

Jan 26, 2009 at 11:26pm

firedraco (6243)

That's pretty sweet! Thanks for the code/references :D

Also: int complain( const char* filename, const char* method ) = win.

Jan 27, 2009 at 7:07am

dkaip (196)

Helps very much. I have many codes for reading. Then uploads all this stuff for everyone who have such needs.
Thank's

Topic archived. No new replies allowed.