C++ Web Extraction

Jan 6, 2011 at 12:55am

Hello Everyone,

I was wondering how to extract data from websites and use it in my program. For example how can I have my program read the HTML code and save it as a text file? I know I can do this manually on Internet Explore under Page>View Source.

Any help would be greatly appreciated

Thanks, Arthur

Jan 6, 2011 at 1:41am

sohguanh (1236)

Try to use the cURL library. It will do a HTTP request to the web page and then the response come back you can then do your own web extraction.

Jan 6, 2011 at 2:52am

ajputnam (18)

I think your right but I don't have a clue on how to do that. I looked up cURL and find it very confusing could you direct me in the right direction?

Last edited on Jan 6, 2011 at 2:53am

Jan 6, 2011 at 3:01am

sohguanh (1236)

I believe cURL come in binaries and libraries form.

Binaries means you install you can use direct already.

E.g
your linux prompt> curl <commandline arguments include the URL>

The webpage contents will be printed on the console. You can pipe the console contents to another program like bash shell, php, c, c++, perl etc to process and extract the contents.

Libraries means you cannot run it stand-alone. You must include the libraries into your own program and then call the API the libraries gave you. Similarly the webpage contents "arrive" in your program and you can do your processing there.

Most ppl choose the easier approach which is use the binaries directly. The contents are then processed by other scripting/compiled languages.

Some ppl use curl to manipulate HTTP request headers so it is good as a *spoofing* tool :P

Jan 6, 2011 at 3:05am

ajputnam (18)

Sounds Insanely cool, but i think its beyond my skill level. Isn't there away in the boost Libraries?

Jan 6, 2011 at 3:08am

sohguanh (1236)

I am not sure about the Boost libraries but cURL is quite established in "webpage extraction" arena.

When you said beyond your skill level you can use the binaries direct instead. Are you familiar with Unix commands like ls, cd, pwd, cp, mv, tar, gzip etc etc ? If you are then curl is no different. All you got to do is learn what command like option it support and you are ready to go isn't it ?

If you have not used Unix commands before then using Boost libraries is much harder as it is a library not a ready to run binaries isn't it ?

Jan 6, 2011 at 3:11am

ajputnam (18)

Not sure i'm and new to unix and the boost libraires, so both seem extreemly hard. I don't know what why would be better to go. All i need is my program to download the source code of a website and save it as a text file. From there i can use File I/O to get what i need. Does that help at all?

Jan 6, 2011 at 3:15am

sohguanh (1236)

All i need is my program to download the source code of a website and save it as a text file.

I believe you mean the HTML markup tags and Javascript code of a website instead.

Well then what I say is what you want.

Step 1
Download and install curl

Step 2
curl <command-line option> > a.txt

Then a.txt will have the contents of that webpage you specify to curl. You only need to learn the options curl support which can be a lot and very comprehensive.

Jan 6, 2011 at 3:18am

ajputnam (18)

Yeah I want the like <dir>text what evet </dir> stuff of the web site. I will download and look into the commands. Thanks for the Help!

Topic archived. No new replies allowed.

C++

Forum

C++ Web Extraction