[C/C++] How To Get The HTML Of a Web Pag

Forum

Forum
Windows Programming
[C/C++] How To Get The HTML Of a Web Pag

[C/C++] How To Get The HTML Of a Web Page

Feb 18, 2011 at 3:39am

Hello, is it possible to get the HTML of a web page?

Lets say we had this web page:

<html>
<body>
<b>Bold Text</b>
</body>
</html>

Is there a function that I can call that will return a string of "<html>
<body>
<b>Bold Text</b>
</body>
</html>"

?

Or anything like it.

This is very important so any help would be greatly appreciated!

Feb 18, 2011 at 4:16am

firedraco (6243)

Not in the standard library, you'd have to get something else for that.

I've heard of libcurl and some others, I'd just try googling, probably the fastest way.

Feb 18, 2011 at 5:27am

richardforc (42)

Hi,
I hope libhtml will help you in this. ref. : http://libhtml.sourceforge.net/

Feb 18, 2011 at 7:31am

Athar (4466)

libhtml is for (or rather, would have been for...) parsing HTML.
To retrieve HTML documents (or anything else), libcurl is the right solution.

Feb 18, 2011 at 9:09am

some random dude (126)

libcurl seems like it is what I'm looking for, but I don't want to go learn a new library for this. Can someone tell me the function(s) that are used in getting the HTML from a web page? Or a link to an example please. Thank you :D

Feb 18, 2011 at 9:41am

Athar (4466)

http://curl.haxx.se/libcurl/c/example.html

Feb 18, 2011 at 10:43am

some random dude (126)

Thanks, but I can't compile this:

#include <stdio.h>
#include <curl.h>
int main ()
{
  CURL *curl;
  CURLcode res;
  curl = curl_easy_init ();
  if (curl)
  {
    curl_easy_setopt (curl, CURLOPT_URL, "http://example.com");
    res = curl_easy_perform (curl);
    curl_easy_cleanup (curl);
  }
  return 0;
}

Edit & run on cpp.sh

Errors:

obj\Release\main.o||In function `main':|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|7|undefined reference to `__imp__curl_easy_init'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|10|undefined reference to `__imp__curl_easy_setopt'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|11|undefined reference to `__imp__curl_easy_perform'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|12|undefined reference to `__imp__curl_easy_cleanup'|
||=== Build finished: 4 errors, 0 warnings ===|

Feb 18, 2011 at 10:46am

Athar (4466)

Apparently you forgot to link the library.

Feb 18, 2011 at 11:24am

some random dude (126)

I can't find a .a or .lib file in the curl download :S

Sorry but I haven't needed to link a library in a long time (almost a year) so I kinda forget how xD

Last edited on Feb 18, 2011 at 11:37am

Feb 18, 2011 at 11:42am

Athar (4466)

There are over a hundred different packages for download, so which one is "the" download?
In any case, you need Win32- Generic, Win32 2000/XP 7.21.4 libcurl.
Or a direct link: http://www.gknw.net/mirror/curl/win32/curl-7.21.4-devel-mingw32.zip

Feb 18, 2011 at 11:59am

some random dude (126)

Okay I got it working, and I tried the examples that I thought did what I want, but they don't. Sorry for being such a "noob" :P Do you know which example will return the HTML for a web page, or perform file I/O to "download" the .html file?

Thanks for all your help :D

Feb 18, 2011 at 12:18pm

Athar (4466)

The third example should do ("get HTTP with headers separate") after throwing out everything related to the header (including modifying the last curl_easy_setopt call as mentioned in the comment).

Feb 18, 2011 at 2:58pm

some random dude (126)

When I tried it on google I got:

<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.

</BODY></HTML>

And not the full HTML of google's home page (google.com)

Here is what I ran:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <curl/curl.h>
#include <curl/types.h>
#include <curl/easy.h>

static size_t write_data(void *ptr, size_t size, size_t nmemb, void *stream)
{
  int written = fwrite(ptr, size, nmemb, (FILE *)stream);
  return written;
}

int main(void)
{
  CURL *curl_handle;
  static const char *headerfilename = "head.txt";
  FILE *headerfile;
  static const char *bodyfilename = "body.txt";
  FILE *bodyfile;

  curl_global_init(CURL_GLOBAL_ALL);

  /* init the curl session */
  curl_handle = curl_easy_init();

  /* set URL to get */
  curl_easy_setopt(curl_handle, CURLOPT_URL, "http://google.com");

  /* no progress meter please */
  curl_easy_setopt(curl_handle, CURLOPT_NOPROGRESS, 1L);

  /* send all data to this function  */
  curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, write_data);

  /* open the files */
  headerfile = fopen(headerfilename,"w");
  if (headerfile == NULL) {
    curl_easy_cleanup(curl_handle);
    return -1;
  }
  bodyfile = fopen(bodyfilename,"w");
  if (bodyfile == NULL) {
    curl_easy_cleanup(curl_handle);
    return -1;
  }

  /* we want the headers to this file handle */
  curl_easy_setopt(curl_handle,   CURLOPT_WRITEDATA, headerfile);

  /*
   * Notice here that if you want the actual data sent anywhere else but
   * stdout, you should consider using the CURLOPT_WRITEDATA option.  */

  /* get it! */
  curl_easy_perform(curl_handle);

  /* close the header file */
  fclose(headerfile);

  /* cleanup curl stuff */
  curl_easy_cleanup(curl_handle);

  return 0;
}

Edit & run on cpp.sh

Feb 18, 2011 at 3:30pm

Athar (4466)

You can either specify the correct URL (www.google.com) or tell libcurl to follow redirects automatically:
curl_easy_setopt(curl_handle,CURLOPT_FOLLOWLOCATION,1);

Feb 18, 2011 at 4:57pm

some random dude (126)

It works :D thank you so much.

Topic archived. No new replies allowed.