Parsing HTML with C++

Dec 6, 2009 at 3:06am
Hey everyone,

I'm having a real hard time trying to parse HTML in C++...

Basically, all I want to do is read an html page, parse it and write out the contents of the page into a tab delimited file...

Right now, I have this piece of code below, however, I am not sure how to read in the HTML page...can someone point me in the right direction...thanks!!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
int main(int argc, char* argv){

	CoInitialize(NULL);
	//ifstream mHTMLCode ("test.html");
	//OLECHAR szHTML[] = OLESTR("<HTML><BODY>Hello World!</BODY></HTML>");
	Navigate("http://digg.com/technology", NULL, NULL, NULL, NULL);
	BOOL SetDesignMode(BOOL bMode);
	MSHTML::IHTMLDocument2Ptr pDoc;
	
	HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, 
									NULL, CLSCTX_INPROC_SERVER,
									IID_IHTMLDocument2, (LPVOID *) &pDoc);
	
	HRESULT GetDocumentHTML(CString& szHTML, BOOL a_bClearDirtyFlag = FALSE);
	
	SAFEARRAY *psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
	VARIANT *param;
	bstr_t bsData = (LPCTSTR)szHTML;
	hr =  SafeArrayAccessData(psa, (LPVOID*)&param);
	param->vt = VT_BSTR;
	param->bstrVal = (BSTR)bsData;

	hr = pDoc->write(psa);
	hr = pDoc->close();
	SafeArrayDestroy(psa);

	CoUninitialize();

	return 1;
}
Dec 6, 2009 at 5:06am
This looks more like "Windows Programming" than "General C++" to me. Please target your posts to the appropriate forum.
Dec 6, 2009 at 11:26am
As soon as the HTML content is loaded into std::string or char[], i think the problem becomes unrelated to windows programming. Than it's a syntax analysis.

Do i understand correctly that what is to do is to convert
1
2
<html><title>my Title
</title><body>Body text</body></html>


to
1
2
3
4
5
6
7
8
<html>
    <title>
        my Title
    </title>
    <body>
        Body text
    </body>
</html>

?
Dec 6, 2009 at 12:55pm
I'm doing a little project with HTML files at the moment and the way I'm reading the files is to read line by line and append what you have read into a string object.

1
2
3
4
while(!readFile.eof()) {
	getline(readFile, fileContentTemp);
	fileContent.append(fileContentTemp);
}
Dec 18, 2009 at 3:17am
i've made changes to my code...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#include <stdio.h>
#include <windows.h>
#include <wininet.h>
#include <string>
#include <comdef.h>
#include <mshtml.h> 

#import <mshtml.tlb> no_auto_exclude 

#pragma comment(lib, "wininet.lib")

#include <iostream>
#include <fstream>

using namespace std;

int main(int argc, char* argv[]){
	CoInitialize(NULL);

	ofstream dbfile ("output.db");
	string sLI;
	string m_strURL;
	HINTERNET hOpen, hFile; 

	MSHTML::IHTMLDocument2Ptr pDoc;
	HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, 
				IID_IHTMLDocument2, (void**)&pDoc);

	SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
	VARIANT *param;
	
	hOpen = InternetOpen("UN/1.0", INTERNET_OPEN_TYPE_PRECONFIG, NULL, NULL, 0);

	hFile = InternetOpenUrl(hOpen, "http://online.wsj.com/public/page/news-global-world.html", NULL, 0, 0, 0);

	if(hFile){
		CHAR buffer[10*1024];
		DWORD dwRead;

		while(InternetReadFile(hFile, buffer, 1024, &dwRead)){
			if(dwRead == 0)
				break;

			buffer[dwRead] = 0;

			bstr_t bsData = (LPCTSTR)buffer;
			hr =  SafeArrayAccessData(psa, (LPVOID*)&param);
			param->vt = VT_BSTR;
			param->bstrVal = (BSTR)bsData;

			cout << buffer << endl;
			dbfile << buffer << endl;
			
			hr = pDoc->write(psa);	

		} //end while loop
		
		hr = pDoc->close();
		InternetCloseHandle(hFile);
		SafeArrayDestroy(psa);
	}
	
	InternetCloseHandle(hOpen);
	dbfile.close();
	
	CoUninitialize();
	return 1;
}


but i still can't figure out how to access the DOM elements and print the text content to a file...for example, i want to parse the HTML and print out the text between <li>some content</li> or <div>some more content</div> or <td>yep some more content</td> or <h1>you guessed it...some more content</h1>
Topic archived. No new replies allowed.