Hello,
I am new to these parts but have used this website and others for a long time to find help to questions I am looking for. I am trying to delete comments from a source file.
Now granted there are many topics on this and I have researched them, but these comments are placed tricky, in a way that some will be erased, but others will stay.
Here is part of the .txt file that the comments are in. If you run my function you will see that it erases all the comments like it should, but it also erases part of the code:
/******************************************************************
Computer Assignment x
Written by: John Doe Date: Unknown
********//****************************//**************************/
#include <fstream>
#include <iomanip>
#include <iostream>
#include <string>
#include <assert.h>
#include <string.h>
usingnamespace std;
#define DATA_FILE "file.txt"
constunsigned MAXWORDS = 1000; // maximum number of words /*in input stream
constunsigned MAXLEN = 20; // number of letters */in longest word
constunsigned LINESIZE = 5; // number of cleaned words in a single line
unsigned num_words = 0; /* number of words //in output list */
char word_name [ MAXWORDS ] [ MAXLEN + 1 ]; /* name of a word */
char prog_name1 [ ] = "// Begin Program Data //";
char prog_name2 [ ] = "/* End Program Data */";
This is my function. It copies chars from the input stream and writes them to a file using output stream.
You can't eliminate the comments by eliminating each kind of comment syntax one at a time. You have to scan the input from beginning to end using a state machine to progressively parse it. Something like
1 2 3 4 5 6 7 8 9 10 11 12 13
state = none
foreach char c in input
if state == none
if c == '/'
state = seen_slash
else
send_to_output(c)
elseif state == seen_slash
if c == '*'
state = in_multiline_comment
elseif c == '/'
state = in_single_line_comment
(etc)
Also, note that single-line comments can actually occupy multiple lines:
1 2 3 4 5
// This is a comment. There are no characters following the backslash \
abort(); This is also part of the comment.
this_is_not_commented();
// This is a comment. There is a space following the backslash: \
this_is_not_commented();
(The syntax highlighter of this site doesn't work quite right.)
Keep track of the current comment state. Something along these lines:
Note: this does not give special treatment to comments which may appear within quoted strings
#include <iostream>
#include <fstream>
enumclass comment_state { C, CPP, NEITHER };
comment_state begin_comment( std::istream& stm, char first )
{
if( first == '/' )
{
if( stm.peek() == '*' ) return comment_state::C ; // start of C comment with /*
elseif( stm.peek() == '/' ) return comment_state::CPP ; /* start of C++ comment with // */
}
return comment_state::NEITHER ;
}
bool end_comment( std::istream& stm, char first, comment_state curr_state )
{
if( curr_state == comment_state::C ) return first == '*' && stm.peek() == '/' ; // */
elseif( curr_state == comment_state::CPP ) return first == '\n' ; /* end of line */
returnfalse ;
}
int main()
{
std::ifstream file( __FILE__ ) ; // open the file /* this file */ for input
comment_state curr_state = comment_state::NEITHER ; // current comment state
char c ;
while( file.get(c) ) /* for each character including white space characters */
{
if( curr_state != comment_state::NEITHER ) /* in a comment right now either
either /* ... */// or // ...\n
{
if( end_comment( file, c, curr_state ) ) //* if the comment has ended *//
{
curr_state = comment_state::NEITHER ;
if( c == '\n' ) std::cout << '\n' ; /* end of C++ comment // ; print the new line */
else file.ignore(1) ; // end of C comment */ ; extract and discard the /
}
}
else // not in either /* or // */ comment right now
{
curr_state = begin_comment( file, c ) ;
if( curr_state == comment_state::NEITHER ) std::cout << c ;
}
}
}
That makes sense. In this way it will know whether it is in a comment or not, and will not output while in a comment. I think this will work for the comment that has been giving me the issue too!
There are only really four states you need to care about:
• in a double-quoted string "..."
• in a single-quoted string '...'
• in a multi-line comment
• neither
It is entirely possible to design a source file that will trip that up, but you might consider that unlikely enough that you can ignore it for personal use.
[edit]
Heh, this was a fun project.
My automaton works with 'state' being the current state method:
Thank you @JLBorges & @Duthomhas, both of your solutions are also very helpful. I am working on this today and will let all of you know how it goes. Thank you for the help!
boosie
[edit]
I have it working now, the only thing left is to keep the comments inside of quotes within the source file. I am thinking of adding another flag to check if it is in between quotes.