Getting links of HTML page

Forum

Forum
UNIX/Linux Programming
Getting links of HTML page

Getting links of HTML page

Hi everybody. I wanna create a "spider program". This program get all links of a HTML page. Can I use the libCurl to make this ? If I can, can anyone help-me? I know make this, but, getting all source code of page, I want just the links.

can anyone help-me please? Thanks.

PanGalactic (1658)

Use libcURL (or libcURLpp) to grab the HTML and then use the Boost Regex library to extract the hyperlinks. You can do that in less than 100 lines of code. Or you can use a libxml2 to parse the HTML and the extract links that way.

andrezc (61)

Well... but Boost library is for C++, I'm using C.

Thanks.

Last edited on

andrezc (61)

I discovered that I can extract the links of page using the libxml2. But, I don't know use this library. Can anybody help-me ?

andrezc (61)

Anybody?

hanst99 (2869)

You don't need a specific library for that though...

andrezc (61)

You don't need a specific library for that though...

Yeah, I know. But I don't know make this (read html's tag). Can you help-me please ?

Thanks.

hanst99 (2869)

You let your program iterate over the characters, wait for a '<' character. If you got one, you set a flag, and at the next character you check wether it's an a or not. If not, you discard it and remove the flag. If it is, you check wether the next character is whitespace. If not, you discard it and remove the flag. If it is, you wait for the "href" sequence. If there is a '>', discard the flag again. If there is a href sequence, you check wether the next characters are whitespaces or =. If you get the =, set another flag... and so on. If you know how html files are built, you should be able to do that much.

rocketboy9000 (562)

Use the posix regex library!

hanst99 (2869)

posix regex? You mean, GNU regex? POSIX is afaik just the standard.

http://www.gnu.org/s/libc/manual/html_node/Regular-Expressions.html

rocketboy9000 (562)

Nope regex.h is standardized: http://pubs.opengroup.org/onlinepubs/007908799/xsh/regex.h.html

Topic archived. No new replies allowed.