How to extract variables from between html tags?

Suppose something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<span class="ooookiig">1</span>

<span class="krakaka">enchanted</span><span class="gagaga">chocolate bar</span>

<span class="ooookiig">2</span>

<span class="krakaka">very remarkable</span><span class="gagaga">flavored cookies</span>

<span class="ooookiig">3</span>

<span class="krakaka">fascinating</span><span class="gagaga">strawberries</span>

(...)

<span class="ooookiig">254</span>

<span class="krakaka">amazing</span><span class="gagaga">pineapples</span>

(...)


And it goes on.

I need to extract the data between the <span> tags and put it in a bidimensional array of [3][300]. With "1", "enchanted" and "chocolate bar" on the first vertical row, for example. How do I do that? Thank you!
What piece of code do I need in order to extract the words between the tags? Because I'll just loop that.
Basically, "ooookiig", "krakaka" and "gagaga" are not important at all.

I just want to get the number between <span class="ooookiig"> and </span>, and then get the word between <span class="krakaka"> and </span>, and so on. I just don't know how it's done.
If you want to keep it simple, but also not too powerfull and possibly buggy then use
std::string::find() (http://www.cplusplus.com/reference/string/string/find/)
and
std::string::substr() (http://www.cplusplus.com/reference/string/string/substr/).

If you want to learn something new, powerful and cool then look into boost::regex.
Last edited on
But how do I get only what's between the > and the </span> ?
Find the '>' and read all the text before reaching "</span>". You can implement this in lots of ways
If you really need to use C++ to do this (and there are *much* better languages to do this sort of processing in) then either use an XML parser such as Xerces or use the Boost Regex Library.

If this is a class assignment where your teacher is trying to show you, intentionally or not, how not to use C++ (the fixed array is usually a giveaway), you are SOL. Bite the bullet and use the string functions R0mai suggests.
and there are *much* better languages to do this sort of processing in
Why do you say this? Text processing can be done easily in every language. C++ is the best language for most things...
There are also XML parsers written using boost::spirit that you can find. You could start with those as
a basis for your code.
Bazzy, you don't get out much, do you? I go crazy when I have to do serious text processing in C++ without third-party libraries. There are lots of other languages that excel at text processing natively. (I would reach for Python, but there are plenty others.) Few would say that C++ excels at text processing, though with the help from third-party libraries it is getting there.

Here is what I mean: I know I can write a DFA regex parser in C++ that will be fast as all get out. But when I want to do text processing, I want to forget about low-level details and write code that deals with the domain of parsing text, not the domain of text parsers.
@PanGalactic
There's nothing wrong in using external library. BTW C++0x is coming http://en.wikipedia.org/wiki/C%2B%2B0x#Regular_expressions
Topic archived. No new replies allowed.