How to parse an html file?

Forum

Forum
Beginners
How to parse an html file?

How to parse an html file?

Apr 29, 2010 at 4:26pm

Hi, I'm trying to write a program that extracts values from a specific html file, yet I'm having trouble figuring a (relatively) easy way to extract certain values from a simple table in an HTML file. I already ran the file through HTML Tidy to produce a clean XHTML result, so the file is standards compliant. So I have an HTML file that has a table, with multiple rows, each row containing cells displaying class marks for people. I'd like to be able to extract the marks for a specific user, i.e just John Doe's marks. How would I go about doing this?

I basically want to be able to selectively extract data from an individual row, not the entire table.

Here's an example of a single row from the table:

      <tr>
        <td><font size="2">John Doe</font></td>

        <td>
          <center>
            <b>75</b>
          </center>
        </td>

        <td>
          <center>
            <font size="2">0</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">0</font>
          </center>
        </td>

        <td>
          <center>
            <font color="#0000FF" size="2"><b>70</b></font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">70</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">85</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">57</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">58</font>
          </center>
        </td>

        <td>
          <center>
            <font color="#0000FF" size="2"><b>57</b></font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">102</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">95</font>
          </center>
        </td>
      </tr>

Apr 29, 2010 at 5:01pm

vulee (21)

Since it is already standard compliant, you could use xml parser to get what you need.

Apr 29, 2010 at 5:37pm

magnificence7 (188)

Maybe, if in your html file only contains certain strings of characters at those cells, like <font size="2"> before the value and </font> after the value, you might use the search algorithms of the std::string class.

I did that once. There whas a html file that contains the 4 character "pid_" in the file paths of certain images at a web page so I what able to find the images by looking for pid_

Apr 29, 2010 at 8:06pm

GetOutOfBox (7)

Ok, I found a library online called CMarkup, which seems to have some really simple implementation from a first glance, but it's licensed for non-commercial users only (which I am), but in the future, when I may want to implement something like this in a proprietary licensed software project, is there a (relatively simple) way to implement XML (not HTML or XHTML, just XML) parsing/writing in C++ without the use of a library, or is that a huge task?

example of what I'd like to do:

Sample Pseudo-XML file:

<bill>
<customername>John Doe</customername>
<cost>99.99</cost>
<methodofpayment>Visa</methodofpayment>
</bill>

Say I wanted to extract the value in the "customername" element, how would I go about doing this without the use of a library?

From what I understand its best for software to utilize XML for saving information to the hard drive over plain text files. Am I correct in this assumption?

Topic archived. No new replies allowed.