Help changing utf8 tolower

Forum

Forum
Beginners
Help changing utf8 tolower

Help changing utf8 tolower

Hy.
I have a problem to change UTF8 chars like Á É Ó É Í Ü Ú Ñ tolower. Tolower doesnt work, i think because he change only ASCII chars. I try to comparing char with hexadecimal code of Á and if is = then change him to á but it doesnt work. I know that UTF8 special chars is coding with 2 bytes. And the first of the chars i want to change is 0xC1.

#include <iostream>
#include <sstream>
#include <string.h>

using namespace std;

int main () {
unsigned char word;
cin >> word;
switch (word)
{
case 0x81 : word = 0xA1;
case 0xC1 : word = 0xC1;
case 0x89 : word = 0xA9;
case 0x8D : word = 0xAD;
case 0x93 : word = 0xB3;
case 0x9A : word = 0xBA;
case 0x9C : word = 0xBC;
case 0x91 : word = 0xB1;
default : word = tolower(word);
}
cout << word;
system("PAUSE");
return 0;
}

tolower works, but the other doesnt work. Somebody can help me please.

Thanks

TheIdeasMan (6817)

Firstly, always use code tags - selct the code then press the <> button on the right.

I am a bit confused about what works and what doesn't - your post is a little contradictory. Here is what I think may be a solution.

Try the std::tolower (instead of the C tolower) which allows a locale argument - see if it works for you.

http://www.cplusplus.com/reference/std/locale/tolower/

You will need to include <locale> to use it.

The only other small thing is the use of the variable name word for something that is a char, is not really logical IMO.

HTH

Cubbi (4774)

There's a lot of confusion here..

UTF8 chars like Á É Ó É Í Ü Ú Ñ

These are not "UTF8 chars", these are just characters that aren't part of ASCII character set. UTF-8 is one of the many ways to encode those characters in a computer.

Tolower doesnt work, i think because he change only ASCII chars

Which tolower()? C++ has three:
http://cplusplus.com/reference/std/locale/tolower/
http://cplusplus.com/reference/clibrary/cctype/tolower/
http://cplusplus.com/reference/clibrary/cwctype/towlower/

The first one works if your characters are stored in wide (wchar_t) or narrow (char) characters, the second one only works for narrow (char) form, the third one only works for wide characters.

I know that UTF8 special chars is coding with 2 bytes. And the first of the chars i want to change is 0xC1.

This is the main source of confusion, I feel. 0xC1 is the value of 'Á' in ISO8859-1, which is a single-byte character set. You can use your old tolower() just fine:

#include <iostream>
#include <clocale>
#include <cctype>

int main()
{
    std::setlocale(LC_ALL, "en_US.iso88591"); // only now Á is 0xc1
    unsigned char big = 0xc1;
    unsigned char small = std::tolower(big);

    std::cout << std::hex << "character code was "
              << +big << " became " << +small << '\n';
}

demo: http://ideone.com/qYhWLE

Now, in UTF-8, the character Á is indeed two bytes, but those bytes are 0xC3 0x81. In order to tolower() that, you will have to first convert it to a wide character representation (stored in a variable of type wchar_t and has the value 0x00c1) and then use tolower() or towlower().

#include <iostream>
#include <clocale>
#include <cwctype>
#include <cstdlib>

int main()
{
    std::setlocale(LC_ALL, "en_US.utf8");

    char utf8[] = {'\xc3', '\x81'};
    wchar_t big;
    std::mbtowc(&big, utf8, sizeof utf8);
// or just skip the whole utf8 conversion
//    wchar_t big = L'Á';

    wchar_t small = std::towlower(big);

    std::wcout << "Big: " << big  << '\n'
               << "Small: " << small << '\n';
}

demo: http://ideone.com/i1Hd7f

Last edited on

Folea (5)

Sorry for my contradictory post. In my program, tolower works.
I think with wchar will work. i have to use that special caracters from UTF-8 where is encode with two bytes.

Thank you for the answers, i think my problem is solved.

Folea (5)

Hy. I have to read a string and change his uppercase chars to lowercase. I read one by one tha caracters from that string, but how can i change only the second byte from the special chars ? if is Á = xc3 x81 to á = xc3 xA1.

I want to implement a function to do the changes, and i have to use UTF-8 encode.

#include <iostream>
#include <sstream>
#include <string.h>
using namespace std;

int main () {
	int contador = 1;
	string frase, aux;
	while(getline(cin, frase)){
		cout << contador << ". " << frase << endl;
		contador++;
		istringstream in(frase);
		while(in >> aux){
			cout << aux << endl;
		}
	}
	return 0;
}

How can i change the second byte when i read an special char.

Thanks

andywestken (4094)

@Folea

Are you working with Linux? Or Windows? (system("PAUSE"); suggests the latter.)

And what IDE are you using? And what o/s? And which language does your o/s use?

Andy

Last edited on

Folea (5)

Andy
I am using Ubuntu.The system("PAUSE") from the first post is because i try to do that on windows, but i have to do it on Ubuntu.
I use c++. I compile with gpp.

Cubbi (4774)

Linux supports UTF-8 quite well, you should let I/O streams do the conversions from UTF-8 to wide for you, that's what they are for:

#include <iostream>
#include <sstream>
#include <string>
using namespace std;

int main () {
    setlocale(LC_ALL, "en_US.utf8");
    wcin.imbue(locale());
    wcout.imbue(locale());
    
        wstring frase; // note: wstring, not string
        while( getline(wcin, frase)) {
            wcout << "entered: " << frase << '\n';
            for(size_t n = 0; n < frase.size(); ++n)
                frase[n] = towlower(frase[n]);
            wcout << "lowercased: " << frase << '\n';
        }
}

online demo: http://ideone.com/agKtmR

Last edited on

Folea (5)

Thanks for all posts, i finally got it to work.

Topic archived. No new replies allowed.