Reading a file until a specific string is found.

closed account (Sw07fSEw)
I'm trying to write a program that can efficiently parse through an XML file. The XML files I'm trying to read are 5 to 20 million characters long, but they're all on ONE line. The whole file itself is only one line, just a really long one. However, 90% of the data on the line are just compressed PDFs, which I don't care about reading. All the data I actually want to read is in the first 10% of the line. What I want to do is parse through the XML until a specific tag is found, in this case, <scanneddocuments> and then disregard everything after that.

I currently have it set up so the program grabs the whole line using getline() and creates a string of 5 million characters; not very efficient. This takes a noticeable amount of time, about 10 to 15 seconds. I want to read all the characters up until a specific tag (string) is found, but I'm not sure how to use a string as a delimiter. I know getline() or other related functions use characters as delimiters, but the main problem is that the data is all on one line. A sample xml file is below...

 
  <?xml version="1.0" encoding="UTF-8>readthis<scanneddocuments><document> 


Is there a way I can avoid reading the whole line and just create a string of characters until the tag <scanneddocuments> is found?
Last edited on
How about:

1. Determine length N of the tag
2. Read first N-1 characters into string "line".
3. If there are less, bail out
4. While (read one more char c)
4a append c to line
4b compare tag with last N characters of line. (Could use string::find)
4c if a match, break from loop
I would read the file tag by tag and add it to an output buffer until you find the tag you want.
Here is a simple example - maybe you can adjust it to your needs. It has hardly any error handling - so use with caution.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <iostream>
#include <string>
#include <fstream>

using namespace std;

bool ReadTag (ifstream &src, string &tag);

int main ()
{
  string search_tag = "<scanneddocuments>";
  string wanted;

  ifstream src ("data.xml");
  if (!src)
  {
    cerr << "\aERROR opening file." << "\n\n";
    exit (EXIT_FAILURE);
  }
  string tag;
  while (ReadTag (src, tag))
  {
    wanted += tag;
    if (tag == search_tag)
      break;
  }
  cout << "Output: " << wanted << "\n\n";
  // TODO save the output wherever you want
  system ("pause");
  return 0;
}

bool ReadTag (ifstream &src, string &tag)
{
  char ch;
  string input;
  bool intag = false;
  while (src.get (ch))
  {
    if (ch == '<')
    {
      intag = true;
      input += ch;
    }
    else if (ch == '>')
    {
      intag = false;
      input += ch;
      tag = input;
      input.clear ();
      return true;
    }
    else
      input += ch;
  }
  return false;
}
This isn't well tested.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <iostream>
#include <string>
#include <fstream>

int main()
{
	std::string s, out;
	std::string tag = "scanneddocuments";
	std::ifstream ifs("test.xml");

	while (std::getline(ifs, s, '<') && std::getline(ifs, s, '>') /*&& s != tag*/)
	{
		out += '<' + s + '>';
		if (s == tag) break;
		std::getline(ifs, s, '<');
		out += s;
		ifs.putback('<');		
	}

	std::cout << out << std::endl;
}
Topic archived. No new replies allowed.