Parsing Strategy for an html files

Forum

Forum
Beginners
Parsing Strategy for an html files

Parsing Strategy for an html files

Feb 22, 2018 at 6:21pm

Hi all!
Sorry for my poor english.
I've a html file like this :

<tr class="from" id="n1" >

 			<td>
		           String Zero
			</td>
 			<td>
  			 String One
			</td>
			<td>
                         String Two 
 			</td>
			<td>
			 String Three
			</td>
			<td>
		         String Four
			</td>

</tr>

<tr class="from" id="n2" >

 			<td>
		           String Zero
			</td>
 			<td>
  			 String One
			</td>
			<td>
                         String Two 
 			</td>
			<td>
			 String Three
			</td>
			<td>
		         String Four
			</td>

</tr>
			
<tr class="from" id="n3" >

 			<td>
		           String Zero
			</td>
 			<td>
  			 String One
			</td>
			<td>
                         String Two 
 			</td>
			<td>
			 String Three
			</td>
			<td>
		         String Four
			</td>

</tr>

And so on..
For ever Table, i need to extract only String Two and String Three.
For this task it's better to use regex or libxml++ or other library?
Can someone give me some ideas for do this?
Thanks!

Feb 22, 2018 at 6:37pm

jonnin (11494)

I have found it to be easier to do it yourself if the data is in a VERY simple format. When the format becomes nested or complicated, you should use a library.
This looks simple enough to hit with reg-ex or even just a find/substring grouping, something like find "<td>", extract string zero, find td a few times, extract string three, find </tr>, repeat...

Last edited on Feb 22, 2018 at 6:38pm

Feb 22, 2018 at 7:34pm

Ganado (6838)

About regex, please read: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Yes, you can use libxml(++/2/whatever) if you deem it worthy of using a library. Otherwise, like jonnin said, if the html is simple enough, just find <td>, extract characters between, until you find </td>.

Last edited on Feb 22, 2018 at 7:35pm

Topic archived. No new replies allowed.

C++

Forum

Parsing Strategy for an html files