How to parse an html file?

Hi, I'm trying to write a program that extracts values from a specific html file, yet I'm having trouble figuring a (relatively) easy way to extract certain values from a simple table in an HTML file. I already ran the file through HTML Tidy to produce a clean XHTML result, so the file is standards compliant. So I have an HTML file that has a table, with multiple rows, each row containing cells displaying class marks for people. I'd like to be able to extract the marks for a specific user, i.e just John Doe's marks. How would I go about doing this?

I basically want to be able to selectively extract data from an individual row, not the entire table.

Here's an example of a single row from the table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
      <tr>
        <td><font size="2">John Doe</font></td>

        <td>
          <center>
            <b>75</b>
          </center>
        </td>

        <td>
          <center>
            <font size="2">0</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">0</font>
          </center>
        </td>

        <td>
          <center>
            <font color="#0000FF" size="2"><b>70</b></font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">70</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">85</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">57</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">58</font>
          </center>
        </td>

        <td>
          <center>
            <font color="#0000FF" size="2"><b>57</b></font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">102</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">95</font>
          </center>
        </td>
      </tr>
Since it is already standard compliant, you could use xml parser to get what you need.
Maybe, if in your html file only contains certain strings of characters at those cells, like <font size="2"> before the value and </font> after the value, you might use the search algorithms of the std::string class.

I did that once. There whas a html file that contains the 4 character "pid_" in the file paths of certain images at a web page so I what able to find the images by looking for pid_
Ok, I found a library online called CMarkup, which seems to have some really simple implementation from a first glance, but it's licensed for non-commercial users only (which I am), but in the future, when I may want to implement something like this in a proprietary licensed software project, is there a (relatively simple) way to implement XML (not HTML or XHTML, just XML) parsing/writing in C++ without the use of a library, or is that a huge task?

example of what I'd like to do:

Sample Pseudo-XML file:

1
2
3
4
5
<bill>
<customername>John Doe</customername>
<cost>99.99</cost>
<methodofpayment>Visa</methodofpayment>
</bill>


Say I wanted to extract the value in the "customername" element, how would I go about doing this without the use of a library?

From what I understand its best for software to utilize XML for saving information to the hard drive over plain text files. Am I correct in this assumption?
Topic archived. No new replies allowed.