Preprocessing information for C++ to HTM

Forum

Forum
Lounge
Preprocessing information for C++ to HTM

Preprocessing information for C++ to HTML conversions

moorecm (1932)

I was reading some documentation on the GNU C preprocessor and found this little gem:

// comment
/* comment */

/\
/ comment!!!
/\
*
comment!!!
*\
/

Is the interpretation of backslash-newline capable of splitting commenting sequences on other implementations?

I was working on a script to convert C++ source to HTML and I wanted to make sure it would color the syntax according to how it would be compiled... Also, does anyone else know any other scenarios to be mindful of for this sort of project?

Last edited on

R0mai (730)

Is the interpretation of backslash-newline capable of splitting commenting sequences on other implementations?

backslash-newline capable of splitting almost anything. Example:


#include <iostream>

int main() {
i\
n\
t\
 a
=
6\
6\
6;
s\
t\
d\
:\
:\
c\
o\
u\
t\
<\
<\
a
<\
<\
s\
t\
d\
:\
:\
e\
n\
d\
l
;
}

So i guess

1
2
3

\
n\
t\

should be colored as a keyword.

Last edited on

moorecm (1932)

I've pretty much abandoned attempting to color anything that is split with lines ending in a backslash. Aside from that, the script should tag preprocessor directives, comments, string literals, character literals, identifiers, numbers, and symbols.

My approach comes from the cpp_to_html program located in the [boost] Spirit Application Repository. To summarize how the script works, it defines a number of lexical elements in regex and then tokenizes a string-representation of the source file accordingly. The tokens are then wrapped in span tags using a simple regex substitution.

The script seems to work pretty well, but it still needs tested.

I setup a small test form where C++ code can be pasted and it will run the script on it and display the results. It's located here:
http://chadmoore.us/test/post.html

All comments are greatly appreciated. I would like to know what "breaks" the script, what colors you prefer, and any other possible improvements that you can come up with. Once the script becomes stable, I will gladly share it with anyone interested. (Maybe I'll post it in here--it's somewhat C++ related)

Bazzy (6281)

#include is not highlighted
In some situations multiline comments don't work:

;/*   multi

line

*/

Last edited on

moorecm (1932)

Thanks Bazzy!

I have a bug with strings in comments vs. comments in strings. I just realized that when I fixed one I broke the other. It should be ok, now.

The bugs that I am aware of are:

1. ~~strings s = "/*//*/"; tags a comment inside of the string literal.~~

2. ~~;/* multi-line comment */ includes the semicolon in the comment.~~

3. ~~#define // first line does not tag preprocessor directives as the first line.~~

4. ~~/* newline then: #define */ tags the preprocessor directive, even inside the block comment.~~

Last edited on

moorecm (1932)

Everything found previously has been resolved.

Bazzy (6281)

strings with comments still have problems:

1
2

"this is a /*str*/ing"
"another // string "

moorecm (1932)

Ok, thanks again! I'll update it when I get a chance.

I think what I'm going to do is type up a test case source file to paste into it for doing regression testing as it evolves.

moorecm (1932)

Ok. I think I've got it. :P This is turning out to be a little more difficult than anticipated!

For preprocessor directives, I am using a pattern that expects the beginning of a line followed by zero or more whitespace characters, the '#', zero or more whitespace characters, and what qualifies as an identifier. (Identifiers are made up of underscores and alphanumeric characters but cannot begin with a number.)

I have added a special cases to tag '#' characters that are unmatched to any of the defined patterns as preprocessor directives. I noticed that some of the boost source files (I used a number of them for my testing) contained things like this:

#
# /*
#  * comment
#  */
#

Since the preprocessor effectively removes them (and they do compile), I wanted to tag them as such. The comment takes precedence in my script, so it should appear the same as it does above.

Similarly, I am also tagging any unmatched backslashes.

Does this sound logical?

Last edited on

Denis (350)

Please doxygen.
http://www.stack.nl/~dimitri/doxygen/

It's a software to generate code documentation from source code.

Duthomhas (13290)

Wow, you sure are opininated today.

If moorecm wanted Doxygen, he'd use it. Or any of the other gadzillion "javadoc" type programs out there.

Do you think you are the code police or something? Just because people don't do things your way does not necessarily mean they are doing it the wrong way.

Don't hijack threads to tell the OP he needs to get a life, or a real computer, or to scrap his project to fit your ideas -- what do you know about what he wants anyway?

moorecm (1932)

Just for reference, here is a quick overview of how the script works. The first step is to define regular expressions that represent the lexical elements of interest. Then, Perl's split command takes those expressions as the delimiter parameter and tokenizes the source file. It then loops through each of the tokens and processes them according to which expression they match. The processing is basically just wrapping the tokens in HTML span tags of a given class and escaping special characters for display in HTML.

Here are my current regular expressions, for reference:


    # define the lexical elements in regex

    my $identifier   = '\b[_[:alpha:]]\w*\b';
                      # identifiers are enclosed by word boundries, contain
                      # alphanumeric text plus the underscore, but cannot
                      # begin with a number

    my $relative     = '".*?"';
    my $absolute     = '<.*?>';
    my $include      = '(?:^|(?<=\n))\s*\#[ \t]*include[ \t]*?(?:'.$relative.'|'.$absolute.')';
                      # include directives begin at a newline and then may contain zero or
                      # more whitespace characters, '#' sign, zero or more whitespace
                      # characters, the text 'include', and then a path/filename surrounded by
                      # either angle brackets or doulbe quotes
                      #
                      # the separation of include directives from other preprocesssor directives
                      # is to support an additional requirement to hyperlink header files

    my $directive    = '(?:^|(?<=\n))\s*\#[ \t]*'.$identifier;
                      # preprocessor directives begin at a newline and then
                      # may contain zero or more whitespace characters,
                      # a '#' sign, zero or more whitespace characters,
                      # and finally an identifier

    my $comment      = '(?:/\*.*?\*/|//.*?(?=(?:\n|$)))';
                      # comments either begin with "/*" and end at the first
                      # "*/" or begin with "//" and end at the first newline

    my $string       = '[lL]?".*?(?:(?<!\\\)|[\\\][\\\])"';
                      # string literals may begin with an L, for wide
                      # characters, and then a " and continue until the first
                      # " that is not preceeded by a backslash (unless its \\')

    my $literal      = '[lL]?\'.*?(?:(?<!\\\)|[\\\][\\\])\'';
                      # character literals are specified with the same pattern
                      # as strings except for using single quotation marks,
                      # although syntactically it should contain at least one
                      # character

    my $keyword      = '\b(?:and_eq|and|asm|auto|bitand|bitor|bool|break|case'.
                       '|catch|char|class|compl|const_cast|const|continue'.
                       '|default|delete|do|double|dynamic_cast|else|enum'.
                       '|explicit|export|extern|false|float|for|friend'.
                       '|goto|if|inline|int|long|mutable|namespace|new'.
                       '|not_eq|not|operator|or_eq|or|private|protected'.
                       '|public|register|reinterpret_cast|return|short'.
                       '|signed|sizeof|static|static_cast|struct|switch'.
                       '|template|this|throw|true|try|typedef|typeid'.
                       '|typename|union|unsigned|using|virtual|void'.
                       '|volatile|wchar_t|while|xor_eq|xor)\b';
                      # keywords are enclosed by word boundries and must match
                      # one of the listed alternatives

    my $number       = '\b\d[xX]?[\daAbBcCdDeEfF]*[lLdDfFuU]?\b';
                      # numbers are enclosed by word boundries, may begin with
                      # 0X, contain numeric digits or hexidecimal characters,
                      # and may be followed by a type determination

    my $symbol       = '[~!%\^&\*\(\)\+={\[}\]:/;,<\.>\?\|\-]+';
                      # symbols must match one or more of the listed
                      # alternatives

    my $leftover     = '[\#\\\]+';
                      # 'leftovers' refer to remaining '#' signs or backslashes
                      # that would be removed by the preprocessor.  they must
                      # match one or more of the listed alternatives

    # the following pattern is used to split the input string into tokens
    my $delim        = '(?:'.$include.'|'.$directive.'|'.$comment.
                         '|'.$string.   '|'.$literal.
                         '|'.$number.   '|'.$symbol.
                         '|'.'\n+'.     '|'.'\b'    .')';

    #...

    # tokenize
    my @tokens = split( /(                       # this grouping causes split
                                                 # to also return delimiters

                           $delim                # match by delimiter pattern

                         )/sox, $input );

    #...

Last edited on

moorecm (1932)

Just an update, I've been toying around with Apache's mod_rewrite module to clean up the somewhat ridiculous URLs on the site and pretty much determined that I'm not happy with the results. It's difficult to get everything just so and I've been brainstorming for other options. I am currently running some tests with Doxygen and will probably end up going that route after all.

Topic archived. No new replies allowed.