need a way to find and either replace or

Forum

Forum
General C++ Programming
need a way to find and either replace or

need a way to find and either replace or delete text from a text file

to start with, sorry about the length of the title, i rewrote the title 3 times and that was the shortest one i felt was accurate enough to use.

now, to get down to it, like i said in the title, i am looking for a (hopefully simple) way for a program to find a block of text, and either replace it with some pre-specified text, or just delete the part of the block i don't want.

here is an example of the HTML code i want to edit:

<BLOCKQUOTE>Binds when equipped<BR />
Back<BR />
33 Armor<BR />
Requires Level 45<BR />
Item Level 50<BR />
 <BR />
</BLOCKQUOTE></DIV>

and the result i am trying to break it down to is just the div tag at the end, everything else in that block of code is stuff i want to delete.

i tried doing this manually, and worked for an hour or more, and barely scratched the surface of the number of these changes that i need to do.

i am hoping to find a way to make a program that looks for the text <BLOCKQUOTE>, and when an entry of this code is found, then it enters the line number where the text was found.

then, i want it to search from the line where it found that first text for the following text </BLOCKQUOTE>. the only difference in the two peices of text is, the first is a beginning tag, and the second is an end tag.

so, the second text searched for has a "/" whereas the first does not.

another possibility for the second search could have it search for the text </BLOCKQUOTE></DIV>. either way should work, i think.

anyway, once it finds where the first entries of both pieces of code are, the code searched for, and everything inbetween the two, should be deleted, except for that </DIV> tag.

now, the problem i am having is, i am a bit out of practice in C++. and i can't quite seem to recall what classes would work for this.

i'd list what i have tried, but its a rather long list, and i need to finish writing this post sometime soon.

firedraco (6248)

In this case I would actually suggest using an editor like Notepad++ that has regular expression support and use that to do a find/replace.

anseloth (12)

well, i use Notepad++, but i didn't think it could do a "find everything that is between each pair of entries of item A and item B".

see, the only part of the code that doesn't change is the parts i am doing the searches for, everything in-between is unique to each entry pretty much every time.

Abramus (285)

Learn regular expressions.

I personally didn't use Notepad++, but if it supports regular expressions then you should be able to automatize your task easily. For example, assuming Perl-like regular expressions are supported, the following expression will select the whole text starting from <BLOCKQUOTE> and ending at the nearest </BLOCKQUOTE>:

<BLOCKQUOTE>[^\x0]*?</BLOCKQUOTE>

anseloth (12)

well, in that case, is there a site you would suggest for learning regular expressions?

i'll try googling it, but i never have much luck finding a good site with search engines, all i seem to find are the poorly written ones. i only found this site thanks to my old C++ professor.

in any case, i'll still give the search a shot, but if you can tell me about any good sites for learning regular expressions, that would be a BIG help!

thanks for any and all help.

Abramus (285)

As I said, I'm not sure which regular expression syntax is supported by Notepad++. In any case you could start with Wikipedia for learning basic concepts:

http://en.wikipedia.org/wiki/Regular_expression

The following site describes boost::regex library. It contains information about Perl, POSIX Basic, and POSIX Extended regular expression syntaxes:

http://www.boost.org/doc/libs/1_35_0/libs/regex/doc/html/index.html

anseloth (12)

well, it's official, i can't get Notepad++'s regular expression find and replace feature to do the kind of search i need, unless i can figure out what language's version of regular expressions.

only, i can't seem to find anything that says what version it uses.

so, this keeps me about at square 1. except, i am hoping to find a way to do this editing of a text file in an automatic way, so that all i need to do is click the C++ program's icon, and the editing of the text file is done for me, instead of editing it manually. the current version of the file has already been edited some, and still has 5082 lines.

so, if anyone can suggest a way i can use C++ programming to edit this mess, that would be VERY helpful.

Galik (2254)

Regular Expressions is definitely a great way to solve this problem. They do, however, take some time to master.

But taking that time is well worth the effort.

You might want to look at the boos regex library:

http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/index.html

Galik (2254)

Here is a brief example:

#include <iostream>
#include <fstream>
#include <sstream>
#include <iterator>

#include <boost/regex.hpp>

int main(int argc, char *argv[])
{
	/*
	 * Regular expression to find the blockquote elements.
	 */
	boost::regex find_exp("<BLOCKQUOTE>.*?</BLOCKQUOTE>");

	/*
	 * Input and output files.
	 */
	std::ifstream ifs("input.html", std::ios::binary);
	std::ofstream ofs("output.html", std::ios::binary);

	if(ifs && ofs)
	{
		char buf[1024];
		std::streamsize len;
		std::ostringstream oss;

		/*
		 * Copy input file into std::string.
		 */
		while((len = ifs.readsome(buf, 1024)) > 0)
		{
			oss.write(buf, len);
		}

		std::string input = oss.str();

		/*
		 * Apply the search and replace sending the result
		 * to the output file.
		 */

		std::ostream_iterator<char, char> ofsi(ofs);

		boost::regex_replace(ofsi
			, input.begin()
			, input.end()
			, find_exp
			,""
			, boost::match_default | boost::format_all);
	}
	else
	{
		return 1;
	}

	return 0;
}

herbert1910 (48)

if you have only one <BLOCKQUOTE> or </BLOCKQUOTE> per line, you can skip stuff before or after your special tag, and skip or keep the lines without one of your special tags.

thisiskept
thisgetskept<BLOCKQUOTE>thisgetsskipped
thiscontinuestobeskipped
thisisskipped</BLOCKQUOTE>thisiskept
thisiskept

this outputs to the screen:

#include <iostream>
#include <fstream>
#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
	ifstream infile;
	string str, tstr;
	size_t found;
	int skip;

	infile.open ("test.txt", ifstream::in);

	skip = 0;
	while (infile.good()) {
		getline(infile, str);

		found = str.find("<BLOCKQUOTE>");
		if(found != string::npos) {
			// get stuff before <BLOCKQUOTE>
			tstr.assign(str.begin(), str.begin()+ found);
			str.assign(tstr);
			// begin skipping stuff
			skip = 1;
			// output now since we skip it later
			cout << str << endl;
		} else {
			found = str.find("</BLOCKQUOTE>");
			if(found != string::npos) {
				// get stuff after </BLOCKQUOTE>
				// </BLOCKQUOTE> is 13 positions wide
				tstr.assign(str.begin()+ found + 13, str.end());
				str.assign(tstr);
				// stop skipping
				skip = 0;
			}
		}
		// for regular lines and </BLOCKQUOTE> lines
		if(skip == 0) cout << str << endl;
	}

	infile.close();

	return 0;
}

anyone know how to do this in a stream instead of line by line?

i don't know if we can use size_t (found) like that. my compiler may be hiding warnings and auto-casting stuff.

searching for more than one special tag per line is tricky, especially if the end-tag is first on the line.

Galik (2254)

anyone know how to do this in a stream instead of line by line?

Here is a stream version:

#include <string>
#include <iostream>
#include <fstream>
#include <sstream>

/**
 * Read text from the input stream until the
 * begining of an HTML tag.
 *
 * @return a std::string containing the
 * text that was read.
 */
std::string read_to_tag_start(std::istream& is)
{
	int ch;
	std::ostringstream oss;
	while(is.good() && is.peek() != '<')
	{
		ch = is.get();
		if(is.good()) { oss.put(ch); }
	}
	return oss.str();
}

/**
 * Read an HTML tag from the input stream.
 *
 * @return a std::string containing the
 * HTML tag that was read.
 */
std::string read_to_tag_end(std::istream& is)
{
	int ch;
	std::ostringstream oss;
	while(is.good() && is.peek() != '>')
	{
		ch = is.get();
		if(is.good()) { oss.put(ch); }
	}

	if(is.good() && is.peek() == '>')
	{
		ch = is.get();
		if(is.good()) { oss.put(ch); }
	}
	return oss.str();
}

int main(int argc, char *argv[])
{
	std::string tag;
	std::ifstream ifs("input.html");
	std::ofstream ofs("output.html");

	/*
	 * Keep going until an error.
	 */
	while(ifs.good())
	{
		/*
		 * Copy text between tags to the output file.
		 */
		ofs << read_to_tag_start(ifs);

		/*
		 * Copy tag to output file.
		 */
		tag = read_to_tag_end(ifs);
		ofs << tag;

		/*
		 * Do we need to skip?
		 */
		if(tag == "<BLOCKQUOTE>")
		{
			/*
			 * Skip all text and tags until end tag.
			 */
			while(tag != "</BLOCKQUOTE>")
			{
				read_to_tag_start(ifs);
				tag = read_to_tag_end(ifs);
			}
			ofs << tag; // output the end tag
		}
	}

	return 0;
}

Last edited on

herbert1910 (48)

Galik, that is neat! thank you.

your stuff works even when there is more than one tag on a line.

i like the '>' check in line 41. very thorough.

Topic archived. No new replies allowed.