[C/C++] How To Get The HTML Of a Web Page

Hello, is it possible to get the HTML of a web page?

Lets say we had this web page:

1
2
3
4
5
<html>
<body>
<b>Bold Text</b>
</body>
</html>


Is there a function that I can call that will return a string of "<html>
<body>
<b>Bold Text</b>
</body>
</html>"

?

Or anything like it.

This is very important so any help would be greatly appreciated!
Not in the standard library, you'd have to get something else for that.

I've heard of libcurl and some others, I'd just try googling, probably the fastest way.
Hi,
I hope libhtml will help you in this. ref. : http://libhtml.sourceforge.net/
libhtml is for (or rather, would have been for...) parsing HTML.
To retrieve HTML documents (or anything else), libcurl is the right solution.
libcurl seems like it is what I'm looking for, but I don't want to go learn a new library for this. Can someone tell me the function(s) that are used in getting the HTML from a web page? Or a link to an example please. Thank you :D
Thanks, but I can't compile this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <stdio.h>
#include <curl.h>
int main ()
{
  CURL *curl;
  CURLcode res;
  curl = curl_easy_init ();
  if (curl)
  {
    curl_easy_setopt (curl, CURLOPT_URL, "http://example.com");
    res = curl_easy_perform (curl);
    curl_easy_cleanup (curl);
  }
  return 0;
}


Errors:

1
2
3
4
5
6
obj\Release\main.o||In function `main':|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|7|undefined reference to `__imp__curl_easy_init'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|10|undefined reference to `__imp__curl_easy_setopt'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|11|undefined reference to `__imp__curl_easy_perform'|
C:\Users\-LeetGamer-\Desktop\Programming\Working\cURL\main.cpp|12|undefined reference to `__imp__curl_easy_cleanup'|
||=== Build finished: 4 errors, 0 warnings ===| 


:S
Apparently you forgot to link the library.
I can't find a .a or .lib file in the curl download :S

Sorry but I haven't needed to link a library in a long time (almost a year) so I kinda forget how xD
Last edited on
There are over a hundred different packages for download, so which one is "the" download?
In any case, you need Win32- Generic, Win32 2000/XP 7.21.4 libcurl.
Or a direct link: http://www.gknw.net/mirror/curl/win32/curl-7.21.4-devel-mingw32.zip
Okay I got it working, and I tried the examples that I thought did what I want, but they don't. Sorry for being such a "noob" :P Do you know which example will return the HTML for a web page, or perform file I/O to "download" the .html file?

Thanks for all your help :D
The third example should do ("get HTTP with headers separate") after throwing out everything related to the header (including modifying the last curl_easy_setopt call as mentioned in the comment).
When I tried it on google I got:

1
2
3
4
5
6
7
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.

</BODY></HTML>


And not the full HTML of google's home page (google.com)

Here is what I ran:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <curl/curl.h>
#include <curl/types.h>
#include <curl/easy.h>

static size_t write_data(void *ptr, size_t size, size_t nmemb, void *stream)
{
  int written = fwrite(ptr, size, nmemb, (FILE *)stream);
  return written;
}

int main(void)
{
  CURL *curl_handle;
  static const char *headerfilename = "head.txt";
  FILE *headerfile;
  static const char *bodyfilename = "body.txt";
  FILE *bodyfile;

  curl_global_init(CURL_GLOBAL_ALL);

  /* init the curl session */
  curl_handle = curl_easy_init();

  /* set URL to get */
  curl_easy_setopt(curl_handle, CURLOPT_URL, "http://google.com");

  /* no progress meter please */
  curl_easy_setopt(curl_handle, CURLOPT_NOPROGRESS, 1L);

  /* send all data to this function  */
  curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, write_data);

  /* open the files */
  headerfile = fopen(headerfilename,"w");
  if (headerfile == NULL) {
    curl_easy_cleanup(curl_handle);
    return -1;
  }
  bodyfile = fopen(bodyfilename,"w");
  if (bodyfile == NULL) {
    curl_easy_cleanup(curl_handle);
    return -1;
  }

  /* we want the headers to this file handle */
  curl_easy_setopt(curl_handle,   CURLOPT_WRITEDATA, headerfile);

  /*
   * Notice here that if you want the actual data sent anywhere else but
   * stdout, you should consider using the CURLOPT_WRITEDATA option.  */

  /* get it! */
  curl_easy_perform(curl_handle);

  /* close the header file */
  fclose(headerfile);

  /* cleanup curl stuff */
  curl_easy_cleanup(curl_handle);

  return 0;
}


=/
You can either specify the correct URL (www.google.com) or tell libcurl to follow redirects automatically:
curl_easy_setopt(curl_handle,CURLOPT_FOLLOWLOCATION,1);
It works :D thank you so much.
Topic archived. No new replies allowed.