WinSock - getting source code from website

Pages: 12
Dec 9, 2009 at 10:10pm
So today was my first time using winsock, and I'm trying to make a program to display the source code of a webpage, but its not working. Here's my code,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <winsock2.h>
#include <windows.h>
#include <iostream>
#pragma comment(lib,"ws2_32.lib")

using namespace std;

int main (){
	WSADATA wsaData;

    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
		cout << "WSAStartup failed.\n";
        system("pause");
		return 1;
    }

	SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);

	struct hostent *host;
	host = gethostbyname("www.google.com");

	SOCKADDR_IN SockAddr;
	SockAddr.sin_port=htons(8888);
	SockAddr.sin_family=AF_INET;
	SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);

	connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr));

	char buffer[1000];
	int nDataLength = recv(Socket,buffer,1000,0);
	cout << buffer;

	closesocket(Socket);
    WSACleanup();

	system("pause");
	return 0;
}


What's the problem?
Dec 9, 2009 at 11:24pm
Port 80 , not 8888.
Dec 9, 2009 at 11:37pm
Oh.. Right. Thanks, lol.

Ok, now it connects. Do you know how to get the source code for the web page? I couldn't find any examples in C++...
Last edited on Dec 9, 2009 at 11:38pm
Dec 10, 2009 at 2:40am
It might be sending more than just source code at first. Namely, the header and whatnot. I haven't done work on webpages for a while, however, so I might be way off.
Dec 11, 2009 at 4:35pm
Here's a working solution!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <winsock2.h>
#include <windows.h>
#include <iostream>
#pragma comment(lib,"ws2_32.lib")
using namespace std;

int main (){
	WSADATA wsaData;

    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
		cout << "WSAStartup failed.\n";
        system("pause");
		return 1;
    }

	SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);

	struct hostent *host;
	host = gethostbyname("www.google.com");//change this to the host!

	SOCKADDR_IN SockAddr;
	SockAddr.sin_port=htons(80);
	SockAddr.sin_family=AF_INET;
	SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);

	connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr));
send(Socket,"GET  HTTP/1.0\r\n\r\n", strlen( "GET  HTTP/1.0\r\n\r\n" ),0);//the space is empty..if you want put some address within the host there(the site booby-traps index.htm(l) so i used nothing...)
	char buffer[100000];
	
	int nDataLength = recv(Socket,buffer,100000,0);
	cout << buffer;

	closesocket(Socket);
    WSACleanup();

	system("pause");
	return 0;
}

It goes to the google site.seems to go into a endless loop...by putting redirects!
Last edited on Dec 11, 2009 at 5:56pm
Dec 12, 2009 at 12:45am
Thanks for the reply. I'm getting some source code now atleast, but it's not the google source code. Here's the output,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
HTTP/1.0 404 Not Found
Date: Sat, 12 Dec 2009 00:43:43 GMT
Content-Type: text/html; charset=UTF-8
Server: gws
Content-Length: 1357
X-XSS-Protection: 0



<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>404 Not Found</title>
<style><!--
body {font-family: arial,sans-serif}
div.nav {margin-top: 1ex}
div.nav A {font-size: 10pt; font-family: arial,sans-serif}
span.nav {font-size: 10pt; font-family: arial,sans-serif; font-weight: bold}
div.nav A,span.big {font-size: 12pt; color: #0000cc}
div.nav A {font-size: 10pt; color: black}
A.l:link {color: #6f6f6f}
A.u:link {color: green}
//--></style>
<script><!--
var rc=404;
//-->
</script>
</head>
<body text=#000000 bgcolor=#ffffff>
<table border=0 cellpadding=2 cellspacing=0 width=100%><tr><td rowspan=3 width=1
% nowrap>
<b><font face=times color=#0039b6 size=10>G</font><font face=times color=#c41200
 size=10>o</font><font face=times color=#f3c518 size=10>o</font><font face=times
 color=#0039b6 size=10>g</font><font face=times color=#30a72f size=10>l</font><f
ont face=times color=#c41200 size=10>e</font>&nbsp;&nbsp;</b>
<td>&nbsp;</td></tr>
<tr><td bgcolor="#3366cc"><font face=arial,sans-serif color="#ffffff"><b>Error</
b></td></tr>
<tr><td>&nbsp;</td></tr></table>
<blockquote>
<H1>Not Found</H1>
The requested URL <code>/1.1</code> was not found on this server.

<p>
</blockquote>
<table width=100% cellpadding=0 cellspacing=0><tr


So it's not connecting to the website and it seems to be cutoff since it ends in <tr and the tag isn't closed (its not a problem with the array size). Do you know what the problem is?
Dec 12, 2009 at 2:08am
Why do you want the source code?
Dec 12, 2009 at 3:25am
First of all, just because I am curious and I like to learn this stuff and get better at it. Secondly, I have a few programs that I would like to incorporate this into, where I need to connect to a website. In this case, if the fact that I'm not getting the source code is a problem because I'm not connecting to the right website.
Dec 12, 2009 at 1:13pm
The apparent cut off because of fragmentation. You need to keep reading.

These matters have been discussed in a number of threads. This one was my last attempt to discuss it.
http://www.cplusplus.com/forum/general/16659/
Dec 13, 2009 at 12:31pm
You don't need Winsock
Just use Inet or COM (1 (URLDtoF) to 6 lines of code)
Last edited on Dec 13, 2009 at 12:36pm
Dec 13, 2009 at 11:55pm
Don't do that. It's non-portable and ties you into using Internet Explorer technology.
Dec 14, 2009 at 6:01pm
closed account (S6k9GNh0)
george135, please give more of a third party view of your suggestions. Don't tell someone to do something that might have structural harm to their program. If I didn't know any better though, I'd think you were a Microsoft Windows representative advertising their crappy software for them.
Last edited on Dec 14, 2009 at 6:02pm
Dec 14, 2009 at 9:14pm
i don´t want to take sides with someone, but this still is the "Windows Programming" forum... with emphasis on windows, i guess^^....
Dec 14, 2009 at 11:22pm
An http request usually has headers associated with it. I suggest you use something like httpfox and have a look at the headers, then replicate these headers in your request.
Dec 15, 2009 at 4:36pm
He has the header, it's in his post.
Dec 19, 2009 at 4:30am
i'm really tired right now, but i think someone already mentioned about continuing to read data, since the part you posted was cut off, that will solve part of your problem...
now, for the other part. some google searches about http GET requests should solve the rest. looking at your code, you seem to omit the URI. the first line of a GET request works like this:
GET <URI> [HTTP version] <crlf>
since your GET request omits the URI, i'm assuming you just want the root directory... it's been a while since i did socket programming, but if i recall correctly, you still have to put "/" in as the URI if you just want the root directory of the main URL you're connecting to.
so, overall, the first line of your request would look something like:
GET / HTTP/1.1\r\n
hopefully that helps fix your problem!
Dec 20, 2009 at 5:09pm
Hi everyone and thanks for the replies. I actually took a break from this for a while, but now I'm back and I've been reading your replies. First of all, Mal Reynolds suggested doing
GET / HTTP/1.1\r\n
with the slash in between the GET and HTTP and now I get the correct header,
1
2
3
4
5
6
7
8
9
10
11
12
13
HTTP/1.1 200 OK
Date: Sun, 20 Dec 2009 17:07:12 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=7001ba1594cb1416:TM=1261328832:LM=1261328832:S=r79A_PQs0OqdK
t5M; expires=Tue, 20-Dec-2011 17:07:12 GMT; path=/; domain=.google.com
Set-Cookie: NID=30=d8IsEDuvj07cFctzKyUq5ry-O9_HfZGJ9tNl3sx_hoHvFjg8dh5K0b_Uf4UX6
ShIcTN9_mciC-01VFgjHaJ-pVhe7oM0zty2V0HNQKCE-cmqxz3KvfJBXVpVC_ez0-4L; expires=Mon
, 21-Jun-2010 17:07:12 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 0
Transfer-Encoding: chunked

But I don't get any body content.

Here's the code I'm using again, (Edited after later posts too).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <winsock2.h>
#include <windows.h>
#include <iostream>
#pragma comment(lib,"ws2_32.lib")

using namespace std;

int main (){
	WSADATA wsaData;

    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
		cout << "WSAStartup failed.\n";
        system("pause");
		return 1;
    }

	SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);

	struct hostent *host;
	host = gethostbyname("www.cplusplus.com");

	SOCKADDR_IN SockAddr;
	SockAddr.sin_port=htons(80);
	SockAddr.sin_family=AF_INET;
	SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);

	cout << "Connecting...\n";
	if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) != 0){
		cout << "Could not connect";
		system("pause");
		return 1;
	}
	cout << "Connected.\n";

	send(Socket,"GET / HTTP/1.1\r\nHost: www.cplusplus.com\r\nConnection: close\r\n\r\n", strlen("GET / HTTP/1.1\r\nHost: www.cplusplus.com\r\nConnection: close\r\n\r\n"),0);
	char buffer[10000];

	int nDataLength;
	while ((nDataLength = recv(Socket,buffer,10000,0)) > 0){		
		int i = 0;
		while (buffer[i] >= 32 || buffer[i] == '\n' || buffer[i] == '\r') {
			cout << buffer[i];
			i += 1;
		}
	}

	closesocket(Socket);
        WSACleanup();

	system("pause");
	return 0;
}

Last edited on Dec 26, 2009 at 12:36am
Dec 20, 2009 at 5:14pm
You're still not reading in a loop.
Dec 20, 2009 at 9:52pm
I'm confused on how to do this. What condition should terminate the loop?
1
2
while (something)
      recv(Socket,buffer,1000,0);
Dec 20, 2009 at 10:05pm
This is what I did and it seems to work. Is this what you were talking about?
1
2
3
4
while (nDataLength != 0){
		nDataLength = recv(Socket,buffer,10000,0);
		cout << buffer;
	}
Pages: 12