Hm writing unicode into a .txt-file

Forum

Forum
Beginners
Hm writing unicode into a .txt-file

Hm writing unicode into a .txt-file

Hello C++ Experts!

I'm a little confused because I think I missed something when it comes to writing unicode letters or signs into a .txt-file just like you can do with ASCII.

I'm trying to do something like that:

#include <iostream>
#include <fstream>

using namespace std;

int main()
{
	fstream f;

	wchar_t deva = L'\u090F';

	f.open("example.txt", ios::out);

	cout << "Welcome!!!\n\n";

	if (f.is_open())
		cout << "Writing textfile... Please Wait...\n\n";

	// write file here
	f << deva << "\n";

	f.close();

	cout << "Textfile ceated... Good Bye!\n\n";

	cout << endl;

	return 0;

}

What I want to do is I want to use some HEX values of unicode letters and write them in a textfile. Now, it streams out the decimal value of my wchar_t variable but I want the - in this case Devanagari letter - to be written in the file.

I assume I have to encode the value with the proper unicode library (or table) but I'm not sure if there is another (easier) way to do it.

Thx in advance,
- MrBr

Duthomhas (13310)

The fstream is actualized over a char... and a value like 0x90F won't fit into a char. You need to properly encode it using one of the UTF algorithms.

I suggest you use UTF-8. Here's a little snippit of code from a library I'm writing right now, actually:

#pragma once
#ifndef DUTHOMHAS_UTF_SEQUENCE_HPP_SNIPPIT
#define DUTHOMHAS_UTF_SEQUENCE_HPP_SNIPPIT

#include <ciso646>
#include <iterator>

#include <stdint.h>

//------------------------------------------------------------------------
typedef uint32_t uchar;

//------------------------------------------------------------------------
// Some special code points
//
enum
  {
  UREPLACEMENT_CHAR = 0x00FFFD,    // Special values used when
  UMAX_BMP_CHAR     = 0x00FFFF,    // processing Unicode
  UMAX_CHAR         = 0x10FFFF     //
  };

//------------------------------------------------------------------------
// Encode a CESU-8 character sequence.
//
// CESU-8 is not a Unicode-conformant encoder because it permits the
// encoding of high-surrogate and low-surrogate code points.
//
// It is otherwise identical to UTF-8.
//
template <typename OutputByteIterator>
OutputByteIterator
encode_cesu8( OutputByteIterator iter, uchar value )
  {
  static uchar8 mask [ 4 ] = { 0x7F, 0x1F, 0x0F, 0x07 };
  static uchar8 mark [ 4 ] = { 0x00, 0xC0, 0xE0, 0xF0 };
  static uchar8 shift[ 4 ] = {    0,    6,   12,   18 };

  if (value > UMAX_CHAR) value = UREPLACEMENT_CHAR;

  int count = (value < 0x80)    ? 0  // count == bytes to write - 1
            : (value < 0x800)   ? 1
            : (value < 0x10000) ? 2
            :                     3;

  *iter++ = ((value >> shift[ count ]) & mask[ count ]) | mark[ count ];
  switch (count)
    {
    case 3: *iter++ = ((value >> 12) & 0x3F) | 0x80;
    case 2: *iter++ = ((value >>  6) & 0x3F) | 0x80;
    case 1: *iter++ = ( value        & 0x3F) | 0x80;
    }

  return iter;
  }

//------------------------------------------------------------------------
// Encode a UTF-8 character sequence.
//
template <typename OutputByteIterator>
inline
OutputByteIterator
encode_utf8( OutputByteIterator iter, uchar value )
  {
  return encode_cesu8(
    iter, is_unicode( value ) ? value : UREPLACEMENT_CHAR
    );
  }

#endif

The next step is to transform your Unicode code point (the U+090F character) into a UTF-8 encoded byte sequence.

string utf8_string( uchar value )
  {
  string result;
  encode_utf8( back_inserter( result ), value );
  return result;
  }

Now you can use it very much as you were:
20 f << utf8_string( deva ) << "\n";

Hope this helps.

_{[edit] Almost forgot those special code point values...}

Last edited on

MrBr (9)

Hi Duoas,

first of all appreciate your help.

Unfortunately I wasn't able to compile your code-snipped because I'm using Visual Studio 2008 (Express Edition). I get the following error:

fatal error C1083: Cannot open include file: 'stdint.h': No such file or directory

Otherwise the code seams helpful to me but I couln't test is. Any ideas to solve the issue?

Thx in advance!

Duthomhas (13310)

Argh. Stupid MS. Try one of these:
http://en.wikipedia.org/wiki/Stdint.h#Downloads

Either that or change line 8 to #include <windows.h> and line 11 to typedef DWORD uchar;

MrBr (9)

Okay I changed line 8 and 11 to the code you stated above.

My code looks like that (with utf8.h being the code-snippet you gave to me in your first post):

#include "stdafx.h"
#include "utf8.h"

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

string utf8_string(uchar value);

int main()
{
	fstream f;

	uchar deva = L'\u0x90F';

	f.open("example.txt", ios::out);

	cout << "Welcome\n?n";

	if (f.is_open())
		cout << "Writing textfile... Please Wait...\n\n";

	// write file here
	f << utf8_string(deva) << "\n";

	f.close();

	cout << "Textfile ceated... Good Bye\n\n";

	cout << endl;

	return 0;

}

string utf8_string( uchar value )
{
  string result;
  encode_utf8( back_inserter( result ), value );
  return result;
}

Unfortunately I still get a compiler error like that:

error C3861: 'is_unicode': identifier not found

while the previous issue seems to be done.

Any idea what I'm missing?
Thx a lot!

Last edited on

Duthomhas (13310)

Sorry. (It isn't easy to just chop pieces of code out...)

Hopefully this will get it all:

enum
  {
  SURROGATE_MASK = 0xFFFFFC00,
  HIGH_SURROGATE =     0xD800,    VALUE_OFFSET = 0x10000,
  LOW_SURROGATE  =     0xDC00,    VALUE_MASK   = 0x3FF,
  MAX_SURROGATE  =     0xDFFF
  };

//------------------------------------------------------------------------
// Verify that a given value is a legal Unicode code point.
//
inline bool is_unicode( uchar c )
  {
  return ((c < UMAX_CHAR)
     and ((c < HIGH_SURROGATE) or (c > MAX_SURROGATE))
     and ((c & UMAX_BMP_CHAR) != 0x00FFFF)
     and ((c & UMAX_BMP_CHAR) != 0x00FFFE));
  }

MrBr (9)

All right we're getting closer :)

I have some compiler error because of undeclared identifier:

error C2065: 'UMAX_CHAR' : undeclared identifier
error C2065: 'UMAX_BMP_CHAR' : undeclared identifier
error C2065: 'UMAX_BMP_CHAR' : undeclared identifier

-MrBr

Last edited on

Duthomhas (13310)

I gave those to you already (in the first post).

Good luck!

MrBr (9)

You're right I'm sorry for not getting into it enough :)

If you don't mind bothering you again - another error occured while compiling. I hope to resolve this quick. There seems to be a missing definition for uchar8.

error C2146: syntax error : missing ';' before identifier 'mask'

Thx in advance for your precious help!

helios (17607)

typedef unsigned char uchar8;

MrBr (9)

It works - thank you !

Duthomhas (13310)

:-)

Topic archived. No new replies allowed.