Hi everybody. I wanna create a "spider program". This program get all links of a HTML page. Can I use the libCurl to make this ? If I can, can anyone help-me? I know make this, but, getting all source code of page, I want just the links.
Use libcURL (or libcURLpp) to grab the HTML and then use the Boost Regex library to extract the hyperlinks. You can do that in less than 100 lines of code. Or you can use a libxml2 to parse the HTML and the extract links that way.
You let your program iterate over the characters, wait for a '<' character. If you got one, you set a flag, and at the next character you check wether it's an a or not. If not, you discard it and remove the flag. If it is, you check wether the next character is whitespace. If not, you discard it and remove the flag. If it is, you wait for the "href" sequence. If there is a '>', discard the flag again. If there is a href sequence, you check wether the next characters are whitespaces or =. If you get the =, set another flag... and so on. If you know how html files are built, you should be able to do that much.