Help with Regex

Hello all, I have a bit of problem getting my regular expressions to work. I know that the expressions I'm using are correct because I've tested them in various tools and I get the desired results.

However, every time I use these expressions in C++, i'm not getting the desired result. The expressions always return false. I'm not sure if I'm escaping properly and am posting these expressions here so a new set of eyes can look at them and maybe help me with this dilemma.

Unescaped expressions:
1
2
/\/\/.*?\/?\*.+?(?=\n|\r|$)|\/\*[\s\S]*?\/\/[\s\S]*?\*\//g
/\/\/.+?(?=\n|\r|$)|\/\*[\s\S]+?\*\//g 


Additional information:
Visual Studio 2010 Express
TR1 extensions for performing regex operations.

Any help is appreciated. Thanks!

Edit:
By the way, these expressions are to find comments inside a string.
Last edited on
I cannot read your expression off the top of my head, but if that's its unescaped form, you need to replace all \ with \\.

That looks about it in this case.

The C++ escape chars are \' \" \? \\ \a \b \f \n \r \t \v

You haven't got any " chars, so no worries there.

Andy

Do you also use Python?
Last edited on
Yeah, I replaced all the backslashes and figured maybe there was something else I was missing.

These are the escaped expressions.
1
2
"/\\/\\/.*?\\/?\\*.+?(?=\\n|\\r|$)|\\/\\*[\\s\\S]*?\\/\\/[\\s\\S]*?\\*\\//g"
"/\\/\\/.+?(?=\\n|\\r|$)|\\/\\*[\\s\\S]+?\\*\\//g"


Unfortunately, I don't use Python. I do want to learn it though. Its my project for next summer.
Well, if it was working before, and all the \s are escaped. I can't see anything else which needs escaping.

Well, I might as well asking the other obvious question: you are telling the regex instance to use the right regex dialect?

Andy

P.S. If you knew Python, you could have tried running regex with both the escaped and the unescaped strings (the latter using a raw string).
Last edited on
Can you post some examples of what your pattern is supposed to match but which fail?
@andywestken

I believe so. Here is the portion of relevant code.

1
2
3
4
5
6
7
8
9
10
11
12
13
	cmatch res_match;

	regex rx("/\\/\\/.*?\\/?\\*.+?(?=\\n|\\r|$)|\\/\\*[\\s\\S]*?\\/\\/[\\s\\S]*?\\*\\//g");
	// Unescaped: /\/\/.*?\/?\*.+?(?=\n|\r|$)|\/\*[\s\S]*?\/\/[\s\S]*?\*\//g

	while (regex_search((this->Code).begin(), (this->Code).end(), rx))
	{
		regex_search((this->Code).c_str(), res_match, rx);
		
		this->AppPreserve.AddCommentToList(res_match[0]);

		this->Code = regex_replace((this->Code), rx, replacement, regex_constants::format_first_only);
	}


@Galik

Sure.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// this is an end-of-line comment
/* this is an inline comment */

// some psuedo-code
int main()
{
	cout << "hello";

return 0;
}

================================================

THIS_IS_A_REPLACEMENT
THIS_IS_A_REPLACEMENT

THIS_IS_A_REPLACEMENT
int main()
{
	cout << "hello";

return 0;
}


I'm not as concerned with getting the first expression to work as I am with the second expression. The first expression is to resolve an ambiguity that might exist in nested comments.

By the way, I'm using this article as reference:
http://www.johndcook.com/cpp_regex.html
This (c++ escaped) regex string works for me, but not yours [1]:

"(/\\*([^*]|[\\r\\n]|(\\*+([^*/]|[\\r\\n])))*\\*+/)|(//.*)"

Your processing loop can also be simplified: you're calling regex_search once to see if you get a match, and then calling it again to get the result.

1
2
3
4
5
6
	while (regex_search((this->Code).c_str(), res_match, rx))
	{
		this->AppPreserve.AddCommentToList(res_match[0]);

		this->Code = regex_replace((this->Code), rx, replacement, regex_constants::format_first_only);
	}


should carry on working as before.

Input:

// this is an end-of-line comment
/* this is an inline comment
 */

// some psuedo-code
int main()
{
	cout << "hello";

return 0;
}


Output:

THIS_IS_A_REPLACEMENT
THIS_IS_A_REPLACEMENT

THIS_IS_A_REPLACEMENT
int main()
{
	cout << "hello";

return 0;
}


With the following preserved comments

1 - // this is an end-of-line commen
2 - /* this is an inline comment
     */
3 - // some psuedo-code


Andy

[1] http://ostermiller.org/findcomment.html
Last edited on
Thanks Andy. The code you have written is a lot cleaner and more efficient.

I will abandoned my previous expression and use the one you gave me. :)

Thanks man!
I prefer the lazy version '//.*$|/\*[^\0]*?\*/'
edit: not tested properly
Last edited on
The expression is good, however it fails when we have something like this:
1
2
3
4
5
6
someString = "An example comment: /* example */";
zxckjzxlck
// The comment around this code has been commented out.
// /*
some_code();
// */ 


I think it's almost impossible to write an expression that will be correct 100% of the time. The approach I'm using now is finding all end-of-line comments first and then finding any remaining comments, at this point should only be inline comments, that the first expression might have missed (using the expression Andy game me).

1
2
"(?://.*)|(/\\*(?:.|[\\n\\r])*?\\*/)/g"    [1]
"(/\\*([^*]|[\\r\\n]|(\\*+([^*/]|[\\r\\n])))*\\*+/)|(//.*)"   [2]


Source:
[1] http://stackoverflow.com/questions/1657066/java-regular-expression-finding-comments-in-code
[2] http://ostermiller.org/findcomment.html
:-) More a case of unwritten!

The code can be made a bit more efficient. The regex_replace is currently doing a second search from the beginning of the string; it's for doing search and replace at the same time. And when you loop, you searching from the beginning again.

If you walk the first string and build the new string as a separate variable you can avoid both of these restarts.

My understanding is that a general solution for the comment problem is impossible. But it hasn't stopped people from trying!

When I needed a comment stripper, I wrote it in Python.

Andy

P.S. ne555 - have you read http://ostermiller.org/findcomment.html yet?
¿what do you mean?
For a full explanation of why the longer regex string is needed; Oster Miller presents a clear explanation of his string.
@Andy

You are definitely right. I've seen significant improvement with modifying the structure of the code. About three times as fast in some of the source files I'm parsing.

I'm not sure how familiar you are with PHP but there a nice function token_get_all() [1] in PHP that tokenizes the code. Then you can just do a search like if ( $token[0] === T_COMMENT ) and find all the comments or strings.

I'm going to have to browse the PHP code to find out how to they this.

[1] http://php.net/manual/en/function.token-get-all.php
Sorry, I don't get it. It seems that it goes to all that trouble just because its package does not support the lazy operator.
¿Aren't the two expressions equivalent?
ne555 - I've just tried your expression and it works for my simple test case.

Sorry about my earlier comment; but as you said your expression was "not properly tested", so I chose to go with the other, longer version as people seemed to trust it. And at the time I was looking at the overall problem (the repeated searches, etc) and wanted to minimize unknowns.

Now I know the code is OK, I can try untested expressions without having to worry whether it's the expression of the code that's wrong.
Last edited on
Topic archived. No new replies allowed.