Retrieving values from a web page

Forum

Forum
General C++ Programming
Retrieving values from a web page

Retrieving values from a web page

Hi I want to retrieve stock index values on a web page but when I execute the code it makes a mistake and displays nothing. Can someone tell me why?

#include <string.h>
#include <winsock2.h>
#include <windows.h>
#include <iostream>
#include <vector>
#include <locale>
#include <sstream>
using namespace std;
#pragma comment(lib,"ws2_32.lib")


string website_HTML;
locale local;
void get_Website(string url );
char buffer[10000];
int i = 0 ;


 //****************************************************

int main( void ){

    get_Website("https://www.boursorama.com/bourse/actions/cotations/" );

    cout<<website_HTML;

    //cout<<"\n\nPress ANY key to close.\n\n";
    //cin.ignore(); cin.get(); 


 return 0;
}

 //****************************************************

void get_Website(string url ){
    WSADATA wsaData;
    SOCKET Socket;
    SOCKADDR_IN SockAddr;
    int lineCount=0;
    int rowCount=0;
    struct hostent *host;
    string get_http;


    get_http = "GET / HTTP/1.1\r\nHost: " + url + "\r\nConnection: close\r\n\r\n";

    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0){
        cout << "WSAStartup failed.\n";
        system("pause");
        //return 1;
    }

    Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
    host = gethostbyname(url.c_str());

    SockAddr.sin_port=htons(80);
    SockAddr.sin_family=AF_INET;
    SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);

    if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) != 0){
        cout << "Could not connect";
        system("pause");
        //return 1;
    }
    send(Socket,get_http.c_str(), strlen(get_http.c_str()),0 );

    int nDataLength;
    while ((nDataLength = recv(Socket,buffer,10000,0)) > 0){        
        int i = 0;
        while (buffer[i] >= 32 || buffer[i] == '\n' || buffer[i] == '\r'){

            website_HTML+=buffer[i];
            i += 1;
        }               
    }

    closesocket(Socket);
    WSACleanup();

}

salem c (3687)

> while (buffer[i] >= 32 || buffer[i] == '\n' || buffer[i] == '\r')
You need to make use of nDataLength as well.

> send(Socket,get_http.c_str(), strlen(get_http.c_str()),0 );
Are you sending the right thing?

https://www.wireshark.org/
The absolutely indispensable tool for network programming.
Or press F12 if you're a Firefox / Chrome user.
The point of these things is to find out what your browser actually sends over the wire when you type in that address.

When you know that, you can have a go at making your code do the same.

Are you sending the whole thing? - Check your return result is the same as your string length.

Thomas1965 (4571)

Why don't you use URLDownloadToFile instead of sockets?
https://docs.microsoft.com/en-us/previous-versions/windows/internet-explorer/ie-developer/platform-apis/ms775123(v=vs.85)

ovufaco (6)

I can retrieve the source of the home page, but the quotation pages :

{"GET":{"scheme": "https", "host": "www.boursorama.com", "filename":"/trade/actions/quotes/page-2", "remote":{"Address": "0.0.0.0:443"}}}}

I don't know how to adapt my request in the programme.

#include <winsock2.h>
#include <windows.h>
#include <iostream>
#pragma comment(lib,"ws2_32.lib")

using namespace std;

int main (){
	WSADATA wsaData;

    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
		cout << "WSAStartup failed.\n";
        system("pause");
		return 1;
    }

	SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
	//www.boursorama.com

	struct hostent *host;
	host = gethostbyname("www.boursorama.com");

	SOCKADDR_IN SockAddr;
	SockAddr.sin_port=htons(80);
	SockAddr.sin_family=AF_INET;
	SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);

	cout << "Connecting...\n";
	if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) != 0){
		cout << "Could not connect";
		system("pause");
		return 1;
	}
	cout << "Connected.\n";
/*
GET <resource> HTTP/1.1\r\n
Host: <web_server_name>\r\n
Connection: close\r\n
\r\n
*/
	send(Socket,"GET /bourse/actions/cotations/page-2 HTTP/1.1\r\n\nHost: www.boursorama.com\r\nConnection: close\r\n\r\n", strlen("GET /bourse/actions/cotations/page-2 HTTP/1.1\r\n\nHost: www.boursorama.com\r\nConnection: close\r\n\r\n"),0);
	char buffer[10000];

	int nDataLength;
while (nDataLength != 0){
		nDataLength = recv(Socket,buffer,10000,0);
		cout << buffer;
	}

	closesocket(Socket);
        WSACleanup();

	system("pause");
	return 0;
}

Last edited on

ovufaco (6)

It shows me "HTTP/1.1 400 Bad Request" and "503 V Service Unavailable".

salem c (3687)

This isn't the 1990's where you can simply fetch a static page and easily extract the information you want.

Also, scraping a commercial site like https://www.boursorama.com/, where you have to register and log in to get all the goodies is almost certainly against their terms of service. If they have behaviour analysis running, they might notice your requests don't match what a normal human browser would do (like requesting multiple pages with zero delay).

Like I said, use the debug console of your browser to figure out what all the transactions really look like under the hood. It ISN'T a simple GET request.

Low level web programming is a PITA.
It you absolutely must use C++, at least use a decent library like https://curl.haxx.se/libcurl/ (there are C++ wrappers for it, if that's your bag).

Even then, it's still a PITA.
When I need to scrape things, I use python with the "beautiful soup" package.

But if your site actively uses browser side javascript to fetch and decode data, then you might need to use https://www.selenium.dev/ to let the browser do all the heavy lifting, before you can extract the results.

ovufaco (6)

I can do it in python but I would have to write the column "Label" and the column "Last" in a file like this:

ARTMARKET.COM,7.380

what should I do, please?

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

PAGE_NAME = ['page-1','page-2','page-3','page-4','page-5','page-6','page-7','page-8','page-9','page-10','page-11']
for i in range(0, 10):
	url = 'https://www.boursorama.com/bourse/actions/cotations/{}'.format(PAGE_NAME[i])
	page = requests.get(url)
	pagetext = page.text
	soup = BeautifulSoup(pagetext, 'html.parser')
	for row in soup.find_all('tr'):
	    for col in row.find_all('td')[0:2]:
        	print(col.text)

Last edited on

ovufaco (6)

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv

PAGE_NAME = ['page-1','page-2','page-3','page-4','page-5','page-6','page-7','page-8','page-9','page-10','page-11']
for i in range(0, 10):
	url = 'https://www.boursorama.com/bourse/actions/cotations/{}'.format(PAGE_NAME[i])
	page = requests.get(url)
	pagetext = page.text
	soup = BeautifulSoup(pagetext, 'html.parser')
	for row in soup.find_all('tr'):
	    for col in row.find_all('td')[0:2]:
	    	with open("output.txt", "a") as f:
	    		print(col.text, file=f)

ovufaco (6)

It's okay, I've got what I need. Thanks for giving me the idea to use python I hadn't thought of. It's true that it's faster. I parse the data in C++ afterwards. Thanks again.

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import sys
sys.stdout = open('output.txt','a')

data = []
PAGE_NAME = ['page-1','page-2','page-3','page-4','page-5','page-6','page-7','page-8','page-9','page-10','page-11']
for i in range(0, 10):
	url = 'https://www.boursorama.com/bourse/actions/cotations/{}'.format(PAGE_NAME[i])
	page = requests.get(url)
	pagetext = page.text
	soup = BeautifulSoup(pagetext, 'html.parser')
	for row in soup.find_all('tr'):
		cols = row.find_all('td'[0:2])
		cols = [ele.text.strip() for ele in cols]
		data.append([ele for ele in cols if ele]) # Get rid of empty values
		for i in data:
			for j in i:
			        print(j, end=",")
			print()

Topic archived. No new replies allowed.