Trying to strip <Tags> off an HTML docum

Forum

Forum
Beginners
Trying to strip <Tags> off an HTML docum

Trying to strip <Tags> off an HTML document

Oct 5, 2015 at 6:50pm

I am trying to read in an HTML file and I am trying to strip the tags '<' '>' and everything in between them. Below is some code I've tried working with for awhile and I just can't seem to get it right. It's both a logic and syntax issue I believe. Here's a better view of what I'm trying to do.

Example: Input from file: <html> Hello World! </html>

Output to screen: Hello World!

As you can see the tags were stripped, but just can't get the code below to do it!?

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main()
{
	int i = 0;
	ifstream inFile;
	string name;
	inFile.open("input.txt");
	int counter = 0;

	name[0] = 0;

	char c;

	while (!inFile.eof())
	{
		inFile.get(c);
		
		name = name + c;

		if (name[i] == '<')
		{
			while (name[i] != '>')
			{
				i++;
			}
		}

		if (name[i] != '<')
		{
			i++;
		}
	
	}

	cout << name;

	system("pause");
	return 0;

}

Edit & run on cpp.sh

Oct 5, 2015 at 8:02pm

Norm Gunderson (112)

You're not far off, but a few things are holding you back:

Currently, you are adding c onto name, all the time - even when c is an invalid (unwanted) character. Instead, only append c onto name if it is a valid character (i.e. use an else statement after your if (name[i] == '<').

If you do the above, you will never need to query the contents of name, so you'll not need the i variable anymore. Instead, just query the value of c & in your while loop, keep pulling chars off of the inFile until your name[i] != '>' condition fails.

Other than some bomb-proofing in the while loop (to make sure the file doesn't have a missing >), that should just about do it.

Oct 5, 2015 at 8:44pm

Outlaw782 (100)

I appreciate the feedback, I will try this after I get off of work. Have a good day!

Oct 5, 2015 at 8:56pm

Beju (20)

Maybe you are also interested in this version:

#include <iostream>
#include <fstream>
#include <regex>
#include <iterator>
#include <vector>

int main(int argc, char * argv[])
{
  std::ifstream f(argv[1]);
  std::istream_iterator<char> begin(f), end;
  std::vector<char> html(begin, end);

  std::regex tags("<[^<]*>");
  std::string output;

  std::regex_replace(std::back_inserter(output), html.begin(), html.end(), tags, "");

  std::cout << output << std::endl;
  return 0;
}

Edit & run on cpp.sh

Build with --std=c++11 :)

Oct 5, 2015 at 9:34pm

Outlaw782 (100)

Here is an update on my program, still though the code seems to be stuck in an infinite loop.

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main()
{
	int i = 0;
	ifstream inFile;
	string name;
	string temp;
	inFile.open("input.txt");
	int counter = 0;

	char c;

	while (!inFile.eof())
	{
		inFile.get(c);

		

		if (c == '<')
		{
			while (c != '>')
			{
				temp = temp + c; // using a string named temp to hold the garbage symbols that we dont want

			}
		}
		
		else if (c != '<')
		{
			name = name + c;
			cout << name;
		}
	}

	

	system("pause");
	return 0;

}

Edit & run on cpp.sh

Last edited on Oct 5, 2015 at 9:34pm

Oct 5, 2015 at 11:39pm

Outlaw782 (100)

I was able to get the program to work, what I was missing was an istream::ignore command

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main()
{
	int i = 0;
	ifstream inFile;
	string name;
	
	inFile.open("input.txt");
	int counter = 0;

	char c;

	while (!inFile.eof())
	{
		inFile.get(c);

		

		if (c == '<')
		{
			inFile.ignore(256, '>');

		}
		
		else if (c != '<')
		{
			name = name + c;
		}
	}

	cout << name << endl;
	

	system("pause");
	return 0;

}

Edit & run on cpp.sh

Topic archived. No new replies allowed.

C++

Forum

Trying to strip <Tags> off an HTML document