C++ Regex -Reading HTML Tags

Hi guys,

I'm trying to read matched html tags

ex:

<html> == </html>
<b> == </b>
etc.

I tried doing this with strings and substrings but got really messy and was not accurate half the time. Im new to this so would like some feedback on how to solve my problem. Ty

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int main() {

  string htmlParagraph = "<html>This is a test for html tags. </html>" ;


  regex Html("^<html>");
  regex HtmlTag("</html>$");

  if(((regex_match (Html,htmlParagraph))&&(regex_match (HtmlTag,htmlParagraph))) == true){
    cout << "This is an html tag. " << endl;
  }



  system("pause");
}
closed account (2UD8vCM9)
If you don't know what vectors are, I'm not sure this will make sense.

I've never used regex, so I'm just going to show you how I would approach it.

I created a function to find all of the text between given tags and return a string vector with all of the strings found between tags. Don't worry too much about how the function works for now if you don't understand it, but focus more on the usage of the function in the main().

Let me know if you need more clarification or if this worked out for you.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#include <iostream>
#include <string>
#include <vector>

using namespace std;

vector<string> FindTextBetween(string StringBeingSearched, string OpeningStringToFind, string ClosingStringToFind);

int main()
{
	vector<string> StoreTheFoundStrings;

	cout << "Test 1." << endl;
	StoreTheFoundStrings = FindTextBetween("<html>This is the text between the html tags.</html>", "<html>", "</html>");
	if (StoreTheFoundStrings.size()>0) //If the size is >0 that means that it found at least 1 phrase between the tags
										//If size is 2, it found 2 phrases between opening/closing tags etc
	{
		for (int i=0; i<StoreTheFoundStrings.size(); i++) //We'll loop through each element in the vector
		{
			cout << "Found string between tags:" << StoreTheFoundStrings[i] << endl; //Then we'll print each element the vector found
		}
	}

	//In the first example, only one vector was found, so the for loop didn't essentially make a difference. However, we could have something like this

	cout << endl << "Test 2. 3 Strings encapsulated in <html> tags." << endl;

	StoreTheFoundStrings = FindTextBetween("<html>This is the text between the html tags.</html> <html>This is another string between html tags.</html> <html>And another</html>", "<html>", "</html>");
	if (StoreTheFoundStrings.size()>0) //If the size is >0 that means that it found at least 1 phrase between the tags
										//If size is 2, it found 2 phrases between opening/closing tags etc
	{
		for (int i=0; i<StoreTheFoundStrings.size(); i++) //We'll loop through each element in the vector
		{
			cout << "Found string between tags:" << StoreTheFoundStrings[i] << endl; //Then we'll print each element the vector found
		}
	}

	cout << endl << "Test 3. Passing strings instead of manually typing in fields." << endl;
	//Also don't forget we can pass strings instead of actually typing in the phrases
	string OpeningTags = "<b>";
	string ClosingTags = "</b>";
	string StringContainingBoldTag = "<b>This is text between the bold tag.</b>";
	StoreTheFoundStrings = FindTextBetween(StringContainingBoldTag, OpeningTags, ClosingTags);
	if (StoreTheFoundStrings.size()>0) //If the size is >0 that means that it found at least 1 phrase between the tags
										//If size is 2, it found 2 phrases between opening/closing tags etc
	{
		for (int i=0; i<StoreTheFoundStrings.size(); i++) //We'll loop through each element in the vector
		{
			cout << "Found string between tags:" << StoreTheFoundStrings[i] << endl; //Then we'll print each element the vector found
		}
	}

	system("pause");
	return 0;
}

vector<string> FindTextBetween(string StringBeingSearched, string OpeningStringToFind, string ClosingStringToFind)
{
	vector<string> VectorStringToReturn;
	string StringToSearch = StringBeingSearched;
	while (true)
	{
		unsigned PosOfFirstString = StringToSearch.find(OpeningStringToFind);
		if (PosOfFirstString == string::npos) //If Opening string contents are not found in the string being searched
		{
			return VectorStringToReturn;
		}
		unsigned PosOfSecondString = StringToSearch.find(ClosingStringToFind);
		if (PosOfSecondString == string::npos) //If closing string contents are not found in the string being searched
		{
			return VectorStringToReturn;
		}
		VectorStringToReturn.resize(VectorStringToReturn.size()+1);
		VectorStringToReturn[VectorStringToReturn.size()-1] = StringToSearch.substr(PosOfFirstString+OpeningStringToFind.size(),PosOfSecondString-ClosingStringToFind.size()+1-PosOfFirstString);
		StringToSearch = StringToSearch.substr(PosOfSecondString+ClosingStringToFind.size(),StringToSearch.size()-1);
		int x=5;
	}
	//cout << StringToSearch.substr(PosOfFirstString+OpeningStringToFind.size(),PosOfSecondString-ClosingStringToFind.size()+1);
	//return VectorStringToReturn;
}
Assuming that we have the simplest html fragments of this form:
"<html>This is a test for html tags. </html>"
without nested tags etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>
#include <string>
#include <regex>

int main()
{
    const std::regex re( R"(\s*<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>\s*)", std::regex::icase ) ;
    // std::regex::icase - ignore case
    // \s* - zero or more whitespace characters
    // /\1 - backreference to first sub-match (subexpression within parantheses)
    // .*? - non-greedy match of zero or more characters (the ? specifies non-greedy match)

    const std::string lines[] = 
    { "<name>venros</name>", "<posts>47</posts>", "  <topic>C++ Regex -Reading HTML Tags</topic>  " } ;
    
    for( const std::string& str : lines )
    {
        std::smatch results ;
        if( std::regex_match( str, results, re ) )
        {
            std::cout << "tag: '" << results[1] // first sub-match - for ([A-Z][A-Z0-9]*)
                      << "'\tvalue: '" << results[2] // second sub-match - for (.*?)
                      << "'\n" ;
        }
    }
}

http://coliru.stacked-crooked.com/a/676ac5875a6783eb
Ty pind Ill look it over and get back to you tomorrow (alot to process).

JLBorges as usualy ty for the help, always there to overwhelm me haha.

What is your code meant to do? Im assuming its supposed to process :
1
2
const std::string lines[] = 
    { "<name>venros</name>", "<posts>47</posts>", "  <topic>C++ Regex -Reading HTML Tags</topic> 


but nothing actually couts.

ty
> Im assuming its supposed to process : ...

Yes. Process each string in the array.


> but nothing actually couts.

Are you using GCC 4.8 (or older)?
I'm using vs 2012.

EDIT:

Pind, I ran your code a couple of times and it works well, sadly this is exactly what I tried doing with strings and substrings and it simply wont work If I have multiple tags such as

<html><b><a> etc.

I do like how you took the substing of <html> and added it to itself to give the starting position. I'm however confused as to why you resized your vector ahead of time:

1
2
3
VectorStringToReturn.resize(VectorStringToReturn.size()+1);

VectorStringToReturn[VectorStringToReturn.size()-1] = StringToSearch.substr(PosOfFirstString+OpeningStringToFind.size(),PosOfSecondString-ClosingStringToFind.size()+1-PosOfFirstString);
Last edited on
> I'm using vs 2012.

It works as expected with Visual Studio 2013. http://rextester.com/KCPCJ59677

The same code should also work with Visual Studio 2012.
closed account (2UD8vCM9)
@Venros I resized my vector when I did because of the fact that after those two if's have been checked, then the tags are definitely there so we can go ahead and resize the vector so that we can fill in the next vector element with the string that we've found.

Not sure if that made sense, I never was good with technical words.
sadly it is not working with my VS, i will book mark this for future use, ill get G++ and see from there.

ty
Topic archived. No new replies allowed.