How to parse an html file?

Apr 29, 2010 at 4:26pm
Hi, I'm trying to write a program that extracts values from a specific html file, yet I'm having trouble figuring a (relatively) easy way to extract certain values from a simple table in an HTML file. I already ran the file through HTML Tidy to produce a clean XHTML result, so the file is standards compliant. So I have an HTML file that has a table, with multiple rows, each row containing cells displaying class marks for people. I'd like to be able to extract the marks for a specific user, i.e just John Doe's marks. How would I go about doing this?

I basically want to be able to selectively extract data from an individual row, not the entire table.

Here's an example of a single row from the table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
      <tr>
        <td><font size="2">John Doe</font></td>

        <td>
          <center>
            <b>75</b>
          </center>
        </td>

        <td>
          <center>
            <font size="2">0</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">0</font>
          </center>
        </td>

        <td>
          <center>
            <font color="#0000FF" size="2"><b>70</b></font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">70</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">85</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">57</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">58</font>
          </center>
        </td>

        <td>
          <center>
            <font color="#0000FF" size="2"><b>57</b></font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">102</font>
          </center>
        </td>

        <td>
          <center>
            <font size="2">95</font>
          </center>
        </td>
      </tr>
Apr 29, 2010 at 5:01pm
Since it is already standard compliant, you could use xml parser to get what you need.
Apr 29, 2010 at 5:37pm
Maybe, if in your html file only contains certain strings of characters at those cells, like <font size="2"> before the value and </font> after the value, you might use the search algorithms of the std::string class.

I did that once. There whas a html file that contains the 4 character "pid_" in the file paths of certain images at a web page so I what able to find the images by looking for pid_
Apr 29, 2010 at 8:06pm
Ok, I found a library online called CMarkup, which seems to have some really simple implementation from a first glance, but it's licensed for non-commercial users only (which I am), but in the future, when I may want to implement something like this in a proprietary licensed software project, is there a (relatively simple) way to implement XML (not HTML or XHTML, just XML) parsing/writing in C++ without the use of a library, or is that a huge task?

example of what I'd like to do:

Sample Pseudo-XML file:

1
2
3
4
5
<bill>
<customername>John Doe</customername>
<cost>99.99</cost>
<methodofpayment>Visa</methodofpayment>
</bill>


Say I wanted to extract the value in the "customername" element, how would I go about doing this without the use of a library?

From what I understand its best for software to utilize XML for saving information to the hard drive over plain text files. Am I correct in this assumption?
Topic archived. No new replies allowed.