C++ HTML Reader and Parser

I'm having some issues with html reading and parsing program.
I'm trying to read a sales listing html and draw the Item name, date of listing and price out.
So far, i am able to read the item name, but reading the other two information is proving to be an issue.
Also, declaring two different getline(in,s)/getline(in,ss) produces different output.
1
2
3
4
5
6
7
8
9
10
11
12
while(getline(in, s))
{
  namePH1 = s.find("/marketplace/view/");
  if (namePH1!=-1)                 
  {
    namePH2 = s.find("</a>");
    namePH3 = namePH1+26;
    string nameTmp = s.substr(namePH3,namePH2-namePH3);  // searching for the Name positions 
    list[current].name = nameTmp;
    current++;
  }
}

The above code works fine for getting the name.
However, when I edit the code to search for the Date of listing, I am unable to get a good result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
while(getline(in, s))
{
  namePH1 = s.find("/marketplace/view/");
  if (namePH1!=-1)                 
  {
    namePH2 = s.find("</a>");
    namePH3 = namePH1+26;
    string nameTmp = s.substr(namePH3,namePH2-namePH3);  // searching for the Name positions 
    list[current].name = nameTmp;

    datePH1 = s.find("date\">");
    datePH2 = s.find("in <")-14;
    datePH3 = datePH1+6;
    string dateTmp = s.substr(datePH3,datePH2-datePH3);
    list[current].date = dateTmp;

    current++;
  }
}

The print out becomes rubbish like so :

Samsung Galaxy Note N7000-		<a href="/marketplace/view/165162">Samsung Galaxy Note N7000</a>						</h3>

This is the actual html snippet if it's any help,
1
2
3
4
<a href="/marketplace/view/165162">Samsung Galaxy Note N7000</a>						</h3>
						<p class="byline">
							<span class="item-list-date">03 May 2012														in <a href="/marketplace/list/mobile">Mobile Phones &amp; Accessories</a>																					by <a href="/marketplace/seller/542707">fajrltd</a>														</span>

It's messy, but that's the raw html that I'm trying to extract the data from.
As you can see, I am able to grab the Name successfully, but my attempt to read out the date of listing is just grabbing everything else.

Also, if i edit my code so that i call in a new getline(), e.g
1
2
3
4
5
6
getline(in,ss);
datePH1 = ss.find("date\">");
datePH2 = ss.find("in <")-14;
datePH3 = datePH1+6;
string dateTmp = ss.substr(datePH3,datePH2-datePH3);
list[current].date = dateTmp;

The output changes to

Samsung Galaxy Note N7000-	<p class="byline">-


I know it's really long and but is anyone out there able to give me some tips or point me in the right direction?
Last edited on
Help anyone?
Topic archived. No new replies allowed.