input/output with special characters

Forum

Forum
Beginners
input/output with special characters

input/output with special characters

Jun 3, 2012 at 3:22am

Sorry for the long post. I want to be as descriptive as possible.

My original task was to rip data about the title, authors, and companies from an online article. I ended up using URLDownloadToFile() to download the website as a text file. Below is a portion of the text file with data on the authors and their respective company:

<h3>Author(s):</h3>
<div class="prod-detail-authors">
<br />
<a href=/servlets/product?PROD_TYP=PAPER&ACN=79436359437&AUTHOR_NAME=Jesus+Benajes&PLA_SW=YES>

    	Jesus Benajes - CMT-Universitat Politècnica de València
</a>
<br />
<a href=/servlets/product?PROD_TYP=PAPER&ACN=79432535370&AUTHOR_NAME=Ricardo+Novella&PLA_SW=YES>

    	Ricardo Novella - CMT-Universitat Politècnica de València
</a>
<br />

    	Daniela De Lima - CMT-Universitat Politècnica de València
<br />
<a href=/servlets/product?PROD_TYP=PAPER&ACN=79493574699&AUTHOR_NAME=Vincent+Dugue&PLA_SW=YES>

    	Vincent Dugue - Renault
</a>
<br />

    	Nicolas Quechon - Renault

</div>

I have made a program that is able to get the authors and companies. URLDownloadToFile() creates a temporary file, I grab the necessary data from it using wide characters and wide strings (i.e. I use wifstream instead or ifstream, wstring instead of string, etc.), then output it to a csv file. The problem arises when the author or company's name has a special, accented character. For instance, in the text included above, the company/university name has an è character. In the csv file, the university name ends up looking like this:

CMT-Universitat PolitÃ¨cnica de ValÃ¨ncia

At first, I thought it might have to do with the fact that I am using wide characters and strings, but when I use basic strings instead I have the same problem.

Then I thought it might have to do with the way URLDownloadToFile() is encoding the text file. I believe it is encoding it using UTF-8 format, so as a test I encoded the original text file created by URLDownloadToFile() with different formats to see if there was any difference. I also created another text file with just the line

    	Jesus Benajes - CMT-Universitat Politècnica de València

and tested the different encryptions. When I grab the necessary data from each file and output it to a csv file, the results are as follows whether I use wide or basic strings and streams ("large" refers to the original text file created by URLDownloadToFile() and "small" refers to the one-line text file I created):

large in UTF-8: does not print special characters correctly.
large in ANSI: does print special characters correctly.
large in Unicode: program doesn't run correctly (or else it's taking a very long time)
large in Unicode big endian: same as Unicode.

small in UTF-8: does print special characters correctly.
small in ANSI: does print special characters correctly.
small in Unicode: program doesn't run correctly (or else it's taking a very long time)
small in Unicode big endian: same as Unicode.

I am having trouble reconciling the fact that large does not work in UTF-8 but small does. What further confuses me is that ANSI would work when UTF-8 would not. I thought UTF-8 was more encompassing.

In any case, if someone could help me illuminate the problem or provide a way around it, I'd be very grateful. Thank you.

EDIT: It is only in the csv files where the special characters show up incorrectly. Not the text files.

Last edited on Jun 3, 2012 at 3:27am

Jun 3, 2012 at 3:38am

codeFoil (39)

How are you viewing the CSV file?
Through a text editor or your own program?

Jun 3, 2012 at 3:46am

CJC0117 (97)

With excel.

Jun 3, 2012 at 3:58am

codeFoil (39)

This seems to be a common issue with excel.

You may have already tried this, but if not, this might help:
http://www.codewiz51.com/wiki/UnicodeCSVExcel.ashx

Jun 3, 2012 at 4:17am

CJC0117 (97)

That's for unicode though. Is that the most encompassing (I'll admit that I know very little about all these different kinds of encryptions)?

Also, forgive me for sounding lazy but I barely understand any of the code from that link you provided. Is it possible to just change the settings from within excel after opening the csv file?

Furthermore, do you have any idea why ANSI worked over UTF-8 ~~and why UTF-8 worked for the one-line file but not the larger file~~?

EDIT: I'm using excel 2010.

EDIT2: Thanks for narrowing down the problem to excel. I opened the csv file in notepad instead, and it works fine. I still need to figure out how to get it to show up correctly in excel though.

EDIT3: I was able to reconcile the fact that the special characters show up incorrectly in the csv file created with the one-line text file but not in the csv file created from the larger text file. It has something to do with the fact that the university name is not on the first line in the larger text file, but it is in the first line of the one-line text file (obviously).

EDIT4: Nope. There's more to it than just being on the first line of the text file vs. not being on the first line. When I tried outputting the university as a name to another temporary one-line text file, then grabbing it again, then outputting it to a csv file, the special characters don't show up correctly. Then when I go to the one-line text file, "save as" another text file encoded in UTF-8, and create a program that grabs the single line from the text file and outputs it to a csv file, the special characters do show up correctly. What is going on? I thought the file was already encoded as UTF-8 in C++, so why does it make a difference when I save is as another text file encoded n UTF-8. This is maddening. I'll stop cluttering my post with edits now, because I really have no idea where to go from here anyways.

Last edited on Jun 3, 2012 at 6:05am

Jun 3, 2012 at 6:09pm

CJC0117 (97)

I found a sufficient solution to the problem. Open the csv file in notepad, go to save as, choose encoding UTF-8, and then save. Then open the csv file in excel. Most characters show up correctly now.

Topic archived. No new replies allowed.