Problem reading odd characters from a file

Hey there, I have some files for input that look like this:
Carlitos, Gauleses , Terra das Vagabundas, SN., 350, 12, 5.83
EMPIRE STATE, Gauleses , EMPIRE STATE, FÊNIX™, 298, 12, 7.00
bigorna, Gauleses , Aldeia d bigorna, DDT@D, 318, 12, 10.44
akemif, Romanos , Aldeia Akemif, DDT@D, 19, 12, 13.04
Black Mason, Gauleses , Kindorim, DDT@D, 424, 12, 15.03
Nyrox, Gauleses , Lyontis, DDT*, 118, 12, 16.64
Black Mason, Gauleses , Katram, DDT@D, 166, 12, 16.76
MAXIMUSBRAGA, Romanos , 1° LEGIÃO, DDT*, 362, 12, 21.54
dericio, Gauleses , dericio, SN-SE™, 375, 12, 24.08
Manmaloqueiro, Romanos , {01} Bael, DDT@D, 330, 12, 26.40
Insano, Teutões , Spartacus-01, DDT@D, 111, 12, 29.00
norangers, Romanos , Rome_2B Rome_2A, , 252, 12, 34.66
Dekitor, Teutões , [00] C3, DDT*, 450, 13, 0.00
Macacosan, Gauleses , Macacópolis, SN-SE™, 372, 13, 5.00
DigBoy, Romanos , A -???? DigBoy ????, FÊNIX², 371, 13, 12.81
HR Luana Tsuki, Gauleses , Éden, DDT*, 39, 13, 14.87
mallcon, Romanos , Bonde do 157, DDT*, 367, 13, 15.13
Biilow, Romanos , 1, FÊNIX™, 242, 13, 18.03
Paulo_Ruan, Romanos , Paulo 000001, FÊNIX™, 229, 13, 20.52


I get this data from a site and I copy paste them to create a .txt file for input (I do this my self, no program involved).
When the program runs it ends up like this on netbeans:
0 [main] program 2336 open_stackdumpfile: Dumping stack trace to program.exe.stackdump

When I type the input file my self it works fine. When I copy the data from the site that the problem occurs, it looks like those strange carachters are causing the problem (notice the ????? are some drawings on text).
Can anyone point me in any direction for what I should do to know what's going on please?
You need to use wide characters to store those strange characters. I might be able to help more if you tell me what you're trying to do with your program.
well, I only use the names and one of the numbers betwin comas to create a map <string,int> type.
Theese data are from an online game, and it changes dayli on the site, so I create a second map and compare them to know what changed from one day to another.
What exactly is wide characters?
A normal char can only store a-z and A-Z. It can't store any characters that have symbols above them or that are in a different language. You have to use a wide character to store those.

õ cannot be stored in a char, but it can be stored in a wide char.
hm, ok. I'm using string type though. My only option here is to go for wide char?
Ehhhhhh... I don't agree with some of the info in this thread. This is actually kind of a complex topic to understand... and unfortunately there's no "quick fix" that I know of that can get this working the way you want. Using wide characters will not solve your problem (I explain why below)




Let's start with some text basics. On computers, everything is represented by a number. Text is the same: each character is represented by a "character code".

Therefore text in a file is really just a series of numbers. When your program is reading that file, it's just reading the numbers... then each number is interpretted as a character in a string.

Depending on which text "encoding" your file uses... different numbers can represent different characters. ASCII is the de-facto standard that is used pretty much everywhere. It dictates that all numbers between 0x00 and 0x7F (0 - 127) represent basic symbols (like $, %, #, etc), English characters & numerals (A-Z, a-z, 0-9 etc), and a few special codes (like the newline character, carriage return, tab, etc).



In basic files... 1 byte in the file represents 1 number (and therefore 1 character). Since a byte is 8 bits, this means each number has a range of 0x00 - 0xFF. Note however, that ASCII only assigns characters to 0x00 - 0x7F. This means 0x80 - 0xFF are not assigned a character in ASCII.

This is where it gets confusing. Different character sets assign different characters to those numbers. And some (like UTF-8) use them to signal the start of a multi-byte character (so no longer does 1 byte = 1 character!)

In order to read and display non-ASCII text, you need to know a few things:

1) What encoding does your text file use? Copy/pasting non-ASCII text into an editor is all fine and good... but when you save that file in the editor it is deciding on an encoding and is writing the text appropriately. You need to figure out how it's saving it. (AFAIK, most good editors will either prompt you, or will default to UTF-8). For me to help with this point: What editor are you using to make this text file?

2) In your program, you need to know what encoding is used when you output text to the user. If you are using the Windows console... this is difficult -- I'm still unsure what encoding it uses. There are functions to change the encoding, but I can never seem to get them to work. For me to help with this point: How are you outputting this text? cout? Are you on Windows?

3) You need to either change the output encoding to match your file's encoding... or you need to convert the data from the file so that it matches the output encoding.



As you can see this is, unfortunately, not very straightforward.

This is also why simply using wide characters won't work. Wide characters just allow you to have a wider range of numbers for each character (ie: up to 0xFFFF instead of up to 0xFF)... but they do not solve the issue with encoding. If you read the file as wide characters, you'll either get garbage or will get the exact same thing.
Disch is right. I apologize for posting incomplete information.
Yeah, I was doing some research on wide char and it is kind of a big topic for something I'm not really interested in knowing about. lol
I'd like to point that this program is for my personal use, I don't need to know the exact name of the players, the odd symbols in their names can be discarded, no problem.

1) What encoding does your text file use? Copy/pasting non-ASCII text into an editor is all fine and good... but when you save that file in the editor it is deciding on an encoding and is writing the text appropriately. You need to figure out how it's saving it. (AFAIK, most good editors will either prompt you, or will default to UTF-8). For me to help with this point: What editor are you using to make this text file?

I thought saving with other encoding would work, I tried all option I had on WordPad: .rtf, .docx, .odt, .txt, .txt with MS-DOS format and .txt with unicode format.

2) In your program, you need to know what encoding is used when you output text to the user. If you are using the Windows console... this is difficult -- I'm still unsure what encoding it uses. There are functions to change the encoding, but I can never seem to get them to work. For me to help with this point: How are you outputting this text? cout? Are you on Windows?

I'm writing my output on another file. Though I belive the program don't even get there before failing.

I'm writing my output on another file.


Ah! That's good! Then none of this matters. You can just use the string without caring about the encoding!

Your crash must be related to something else... and not the character encoding. Can you post your program code?
This is really odd because I wrote my input files by hand and it worked. Here it is my code, my variables are in my natural language (portuguese), feel free to tell me anything odd:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
#include <iostream>
#include <vector>
#include <math.h>
#include <stdlib.h> //itoa
#include <cstdlib>
#include <cstring>
#include <sstream>
#include <map>
#include <fstream>
#include <cmath> //raiz
#include <string>

using namespace std;


int main() {
    int i,mudanca;
    string nomearq,linha,linha_aux,jogador;
    ifstream read;
    ofstream write;
    map<string,int> mapa_ontem;
    map<string,int> mapa_hoje;
    map<string,int> ::iterator itx;
    map<string,int> ::iterator ity;
    cout<<"Tabela de ontem?"<<endl;
//    cin>>nomearq;
    nomearq="ontem.txt";
    
    read.open(nomearq.c_str(), ios::in);
    if(read.is_open())
    {
        cout<<"Arquivo encontrado..."<<endl;
        while(!read.eof())
        {
            getline(read,linha);
                linha_aux=linha;
                linha.erase(linha.find_first_of(","),linha.size());
                jogador=linha;
                linha=linha_aux;
                for(i=0;i<5;i++)
                    linha.erase(0,linha.find_first_of(",")+1);
                linha.erase(linha.find_first_of(","),linha.size());
                mudanca=atoi(linha.c_str());
                mapa_ontem[jogador]=mudanca;
        } 
    }
    else
        cout<<"Arquivo nao encontrado. Recomece o programa."<<endl;
    read.close();

    cout<<"Tabela de hoje?"<<endl;
//    cin>>nomearq;
    nomearq="hoje.txt";
    
    read.open(nomearq.c_str(), ios::in);
    if(read.is_open())
    {
        cout<<"Arquivo encontrado..."<<endl;
        while(!read.eof())
        {
            getline(read,linha);
                linha_aux=linha;
                linha.erase(linha.find_first_of(","),linha.size());
                jogador=linha;
                linha=linha_aux;
                for(i=0;i<5;i++)
                    linha.erase(0,linha.find_first_of(",")+1);
                linha.erase(linha.find_first_of(","),linha.size());
                mudanca=atoi(linha.c_str());
                mapa_hoje[jogador]=mudanca;
        } 
    }
    else
        cout<<"Arquivo nao encontrado. Recomece o programa."<<endl;
    read.close();
    
    
    write.open("SAIDA: jogadores novos na lista de hoje.txt",ios::out); 
    ////////////////////////////////////////////////////////////////////
    write<<"Jogadores que evoluiram ontem, mas nao hoje(INATIVO):"<<endl;
    for(itx=mapa_hoje.begin();itx!=mapa_hoje.end();itx++)
    {        
        ity=mapa_ontem.find(itx->first);
        if((ity!=mapa_ontem.end()) && (ity->second>0) && (itx->second<=0))
            write<<"-"<<ity->first<<endl;    
    }
    write<<endl<<"Jogadores que nao evoluiram ontem, mas evoluiram hoje(ATIVO):"<<endl;
    for(itx=mapa_hoje.begin();itx!=mapa_hoje.end();itx++)
    {        
        ity=mapa_ontem.find(itx->first);
        if((ity!=mapa_ontem.end()) && (ity->second<=0) && (itx->second>0))
            write<<"-"<<ity->first<<endl;    
    }
    write<<endl<<"Jogadores que estao na lista de hoje, mas nao na de ontem:"<<endl;
    for(itx=mapa_hoje.begin();itx!=mapa_hoje.end();itx++)
    {        
        ity=mapa_ontem.find(itx->first);
        if(ity==mapa_ontem.end())
                write<<"-"<<ity->first<<endl;  
    }
    /////////////////////////////////////////////////////////////////////
    

    write.close();
    
}
Last edited on
line 78: Filenames cannot contain colons. That's a reserved character to identify the volume. Remove that colon from the filename.

line 63 and 68 will crash if 'hoje.txt' does not contain any commas (ie: if no comma is found... then find_first_of will return npos...and you cannot give npos as the first param to erase)


I don't see anything else that could be causing a problem.


The file you posted in your original post... is that hoje.txt or ontem.txt? And can you post the other text file?
Last edited on
Hoje and ontem means today and yesterday. They keep the same names and only some of the numbers change by each day. So yeah, that post in the begining shows hoje.txt and ontem.txt, their difference are some of the numbers betwin the comas.
http://travian.ws/analyser.pl?s=br1;q=40%2C-60%2C20&csv=1
here you can see where I get them from.
1
2
3
4
5
6
7
// don't loop on eof, you'll end too late
        while(!read.eof())
        {
            getline(read,linha);

//loop on the reading operation
while( getline(read,linha) ){

with that I could not longer reproduce your issue.
If you still have problems, run through a debugger and post the call stack when it crashes.
Last edited on
line 63 and 68 will crash if 'hoje.txt' does not contain any commas (ie: if no comma is found... then find_first_of will return npos...and you cannot give npos as the first param to erase)

This fixed the problem, so odd because I was pretty sure there was the same number of comas in each line.
I was so sure it was the odd characteres. Thanks for the attention guys!
Last edited on
Topic archived. No new replies allowed.