Sorry for the long post. I want to be as descriptive as possible.
My original task was to rip data about the title, authors, and companies from an online article. I ended up using URLDownloadToFile() to download the website as a text file. Below is a portion of the text file with data on the authors and their respective company:
<h3>Author(s):</h3>
<div class="prod-detail-authors">
<br />
<a href=/servlets/product?PROD_TYP=PAPER&ACN=79436359437&AUTHOR_NAME=Jesus+Benajes&PLA_SW=YES>
Jesus Benajes - CMT-Universitat Politècnica de València
</a>
<br />
<a href=/servlets/product?PROD_TYP=PAPER&ACN=79432535370&AUTHOR_NAME=Ricardo+Novella&PLA_SW=YES>
Ricardo Novella - CMT-Universitat Politècnica de València
</a>
<br />
Daniela De Lima - CMT-Universitat Politècnica de València
<br />
<a href=/servlets/product?PROD_TYP=PAPER&ACN=79493574699&AUTHOR_NAME=Vincent+Dugue&PLA_SW=YES>
Vincent Dugue - Renault
</a>
<br />
Nicolas Quechon - Renault
</div> |
I have made a program that is able to get the authors and companies. URLDownloadToFile() creates a temporary file, I grab the necessary data from it using wide characters and wide strings (i.e. I use wifstream instead or ifstream, wstring instead of string, etc.), then output it to a csv file. The problem arises when the author or company's name has a special, accented character. For instance, in the text included above, the company/university name has an è character. In the csv file, the university name ends up looking like this:
CMT-Universitat Politècnica de València |
At first, I thought it might have to do with the fact that I am using wide characters and strings, but when I use basic strings instead I have the same problem.
Then I thought it might have to do with the way URLDownloadToFile() is encoding the text file. I believe it is encoding it using UTF-8 format, so as a test I encoded the original text file created by URLDownloadToFile() with different formats to see if there was any difference. I also created another text file with just the line
Jesus Benajes - CMT-Universitat Politècnica de València |
and tested the different encryptions. When I grab the necessary data from each file and output it to a csv file, the results are as follows whether I use wide or basic strings and streams ("large" refers to the original text file created by URLDownloadToFile() and "small" refers to the one-line text file I created):
large in UTF-8:
does not print special characters correctly.
large in ANSI:
does print special characters correctly.
large in Unicode: program doesn't run correctly (or else it's taking a very long time)
large in Unicode big endian: same as Unicode.
small in UTF-8:
does print special characters correctly.
small in ANSI:
does print special characters correctly.
small in Unicode: program doesn't run correctly (or else it's taking a very long time)
small in Unicode big endian: same as Unicode.
I am having trouble reconciling the fact that large does not work in UTF-8 but small does. What further confuses me is that ANSI would work when UTF-8 would not. I thought UTF-8 was more encompassing.
In any case, if someone could help me illuminate the problem or provide a way around it, I'd be very grateful. Thank you.
EDIT: It is only in the csv files where the special characters show up incorrectly. Not the text files.