I'm having some issues with html reading and parsing program.
I'm trying to read a sales listing html and draw the Item name, date of listing and price out.
So far, i am able to read the item name, but reading the other two information is proving to be an issue.
Also, declaring two different getline(in,s)/getline(in,ss) produces different output.
1 2 3 4 5 6 7 8 9 10 11 12
|
while(getline(in, s))
{
namePH1 = s.find("/marketplace/view/");
if (namePH1!=-1)
{
namePH2 = s.find("</a>");
namePH3 = namePH1+26;
string nameTmp = s.substr(namePH3,namePH2-namePH3); // searching for the Name positions
list[current].name = nameTmp;
current++;
}
}
|
The above code works fine for getting the name.
However, when I edit the code to search for the Date of listing, I am unable to get a good result.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|
while(getline(in, s))
{
namePH1 = s.find("/marketplace/view/");
if (namePH1!=-1)
{
namePH2 = s.find("</a>");
namePH3 = namePH1+26;
string nameTmp = s.substr(namePH3,namePH2-namePH3); // searching for the Name positions
list[current].name = nameTmp;
datePH1 = s.find("date\">");
datePH2 = s.find("in <")-14;
datePH3 = datePH1+6;
string dateTmp = s.substr(datePH3,datePH2-datePH3);
list[current].date = dateTmp;
current++;
}
}
|
The print out becomes rubbish like so :
Samsung Galaxy Note N7000- <a href="/marketplace/view/165162">Samsung Galaxy Note N7000</a> </h3>
|
This is the actual html snippet if it's any help,
1 2 3 4
|
<a href="/marketplace/view/165162">Samsung Galaxy Note N7000</a> </h3>
<p class="byline">
<span class="item-list-date">03 May 2012 in <a href="/marketplace/list/mobile">Mobile Phones & Accessories</a> by <a href="/marketplace/seller/542707">fajrltd</a> </span>
|
It's messy, but that's the raw html that I'm trying to extract the data from.
As you can see, I am able to grab the Name successfully, but my attempt to read out the date of listing is just grabbing everything else.
Also, if i edit my code so that i call in a new getline(), e.g
1 2 3 4 5 6
|
getline(in,ss);
datePH1 = ss.find("date\">");
datePH2 = ss.find("in <")-14;
datePH3 = datePH1+6;
string dateTmp = ss.substr(datePH3,datePH2-datePH3);
list[current].date = dateTmp;
|
The output changes to
Samsung Galaxy Note N7000- <p class="byline">-
|
I know it's really long and but is anyone out there able to give me some tips or point me in the right direction?