Hello !
I'm struggling to reencode a string that is a HTML document.
Everything works, except non standard characters (French characters but also some points !) that aren't expressed through HTML escape sequences.
I think it's because their numbers are encoded in hexadecimal and libcurl interpret it as a decimal number, but sometimes, characters are split it two char !
I have tried to change the encoding of the characters on VS and to reencode the string, but it didn't work.
I am on Windows 10 x64 using Visual Studio 2019 community.
Few example (becomes = ->):
Crédits -> Cr├®dits
badges… Retrouvez -> badgesÔǪ┬áRetrouvez
The page in question is encoded in UTF-8. You need to first read the entire contents as a binary buffer then use a UTF-8 decoder to get back a std::wstring (or std::u32string if wstring is not wide enough for you).
It's better, but still not good: it appears that non-ASCII characters are shifted, from ~15 and I couldn't find a logic to this.
Example:
(é->Ú ; è ->Þ ; à -> Ó)
On top of that I'm worried, I wanted my crawler to be really fast, (with only 1 string traversal) and this operation makes it significantly slower.
Are you on Windows? That's just how they're being displayed on the console. The actual data in memory is correct.
I wanted my crawler to be really fast, (with only 1 string traversal)
As long as the logic you're trying to run on the data doesn't require random access (i.e. seeking back and forth), you could in theory design a state machine capable of decoding the UTF-8, parsing the HTML, and searching for interesting strings in the document, all in one pass of the raw binary data coming from the network. The problem is that it wouldn't be faster than doing multiple passes over the data, it would just use less memory, since you wouldn't need to hold the entire document in memory at once. It would only be faster if, for example, the data you need is within the first 10% of the document, and once you have that you can abort the download of the remaining 90%.
It's possible that decoding the UTF-8 is slower than just processing the raw binary (it shouldn't be, but it's possible that the codecvt implementation is inefficient), but if you need to process the character values and not the byte values then it doesn't make any difference, because you need to decode the UTF-8 one way or another. At best you could try a different decoder.
This is really weird,
I followed these instructions https://stackoverflow.com/a/1875622
and after I ticked the option and I build my program with this in the main :
Ok I have successfully exported in a file and characters are correct !
One last thing, is there any faster decoder than codecvt ?
I precisely wanted to do a state machine, but I'm convinced it would be useful, in order to hold hundreds of visited URLs. It would significantly reduce the number of comparisons, wouldn't it ?
A bot that searches for websites with enough keywords from a keyword list and then creates a graphviz map. atm I'm trying to resolve the encoding problem in a separate project.
I'm getting mad at another issue; I tried to build my string as a wstring from the beginning to improve my code but my wstring won't be printed ?!
The above code won't work. If the website is returning the content encoded as UTF-8 then you have no choice but to run a decoder, if you want to get the character data out. Merely casting the pointer to the type you need doesn't do anything.
Also, since sizeof(wchar_t) > sizeof(char), line 12 will cause an out-of-bounds access when std::wstring::append() attempts to read size * nmemb characters, when in fact only size * nmemb bytes are available.
A bot that searches for websites with enough keywords from a keyword list and then creates a graphviz map. atm I'm trying to resolve the encoding problem in a separate project.
So how does this problem statement relate to this:
I precisely wanted to do a state machine, but I'm convinced it would be useful, in order to hold hundreds of visited URLs.
?
Once you've extracted the useful data from a request and stored its relationships to the data you already had, why would you need to keep the content around? Like, imagine something like
> On top of that I'm worried, I wanted my crawler to be really fast, (with only 1 string traversal)
> and this operation makes it significantly slower.
You need to make it 'right' before you even begin to think about making it 'fast'.
The elephant in the room is your network speed.
You can easily afford small performance sacrifices for the sake of clean, easy to read code.