[C/C++] How To Get The HTML Of a Web Page

Feb 18, 2011 at 3:39am
Hello, is it possible to get the HTML of a web page?

Lets say we had this web page:

1
2
3
4
5
<html>
<body>
<b>Bold Text</b>
</body>
</html>


Is there a function that I can call that will return a string of "<html>
<body>
<b>Bold Text</b>
</body>
</html>"

?

Or anything like it.

This is very important so any help would be greatly appreciated!
Feb 18, 2011 at 4:16am
Not in the standard library, you'd have to get something else for that.

I've heard of libcurl and some others, I'd just try googling, probably the fastest way.
Feb 18, 2011 at 5:27am
Hi,
I hope libhtml will help you in this. ref. : http://libhtml.sourceforge.net/
Feb 18, 2011 at 7:31am
libhtml is for (or rather, would have been for...) parsing HTML.
To retrieve HTML documents (or anything else), libcurl is the right solution.
Feb 18, 2011 at 9:09am
libcurl seems like it is what I'm looking for, but I don't want to go learn a new library for this. Can someone tell me the function(s) that are used in getting the HTML from a web page? Or a link to an example please. Thank you :D
Feb 18, 2011 at 9:41am
Feb 18, 2011 at 10:43am
Thanks, but I can't compile this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <stdio.h>
#include <curl.h>
int main ()
{
  CURL *curl;
  CURLcode res;
  curl = curl_easy_init ();
  if (curl)
  {
    curl_easy_setopt (curl, CURLOPT_URL, "http://example.com");
    res = curl_easy_perform (curl);
    curl_easy_cleanup (curl);
  }
  return 0;
}


Errors:

1
2
3
4
5
6
obj\Release\main.o||In function `main':|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|7|undefined reference to `__imp__curl_easy_init'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|10|undefined reference to `__imp__curl_easy_setopt'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|11|undefined reference to `__imp__curl_easy_perform'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|12|undefined reference to `__imp__curl_easy_cleanup'|
||=== Build finished: 4 errors, 0 warnings ===| 


:S
Feb 18, 2011 at 10:46am
Apparently you forgot to link the library.
Feb 18, 2011 at 11:24am
I can't find a .a or .lib file in the curl download :S

Sorry but I haven't needed to link a library in a long time (almost a year) so I kinda forget how xD
Last edited on Feb 18, 2011 at 11:37am
Feb 18, 2011 at 11:42am
There are over a hundred different packages for download, so which one is "the" download?
In any case, you need Win32- Generic, Win32 2000/XP 7.21.4 libcurl.
Or a direct link: http://www.gknw.net/mirror/curl/win32/curl-7.21.4-devel-mingw32.zip
Feb 18, 2011 at 11:59am
Okay I got it working, and I tried the examples that I thought did what I want, but they don't. Sorry for being such a "noob" :P Do you know which example will return the HTML for a web page, or perform file I/O to "download" the .html file?

Thanks for all your help :D
Feb 18, 2011 at 12:18pm
The third example should do ("get HTTP with headers separate") after throwing out everything related to the header (including modifying the last curl_easy_setopt call as mentioned in the comment).
Feb 18, 2011 at 2:58pm
When I tried it on google I got:

1
2
3
4
5
6
7
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.

</BODY></HTML>


And not the full HTML of google's home page (google.com)

Here is what I ran:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <curl/curl.h>
#include <curl/types.h>
#include <curl/easy.h>

static size_t write_data(void *ptr, size_t size, size_t nmemb, void *stream)
{
  int written = fwrite(ptr, size, nmemb, (FILE *)stream);
  return written;
}

int main(void)
{
  CURL *curl_handle;
  static const char *headerfilename = "head.txt";
  FILE *headerfile;
  static const char *bodyfilename = "body.txt";
  FILE *bodyfile;

  curl_global_init(CURL_GLOBAL_ALL);

  /* init the curl session */
  curl_handle = curl_easy_init();

  /* set URL to get */
  curl_easy_setopt(curl_handle, CURLOPT_URL, "http://google.com");

  /* no progress meter please */
  curl_easy_setopt(curl_handle, CURLOPT_NOPROGRESS, 1L);

  /* send all data to this function  */
  curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, write_data);

  /* open the files */
  headerfile = fopen(headerfilename,"w");
  if (headerfile == NULL) {
    curl_easy_cleanup(curl_handle);
    return -1;
  }
  bodyfile = fopen(bodyfilename,"w");
  if (bodyfile == NULL) {
    curl_easy_cleanup(curl_handle);
    return -1;
  }

  /* we want the headers to this file handle */
  curl_easy_setopt(curl_handle,   CURLOPT_WRITEDATA, headerfile);

  /*
   * Notice here that if you want the actual data sent anywhere else but
   * stdout, you should consider using the CURLOPT_WRITEDATA option.  */

  /* get it! */
  curl_easy_perform(curl_handle);

  /* close the header file */
  fclose(headerfile);

  /* cleanup curl stuff */
  curl_easy_cleanup(curl_handle);

  return 0;
}


=/
Feb 18, 2011 at 3:30pm
You can either specify the correct URL (www.google.com) or tell libcurl to follow redirects automatically:
curl_easy_setopt(curl_handle,CURLOPT_FOLLOWLOCATION,1);
Feb 18, 2011 at 4:57pm
It works :D thank you so much.
Topic archived. No new replies allowed.