I'm trying to write a program that can efficiently parse through an XML file. The XML files I'm trying to read are 5 to 20 million characters long, but they're all on ONE line. The whole file itself is only one line, just a really long one. However, 90% of the data on the line are just compressed PDFs, which I don't care about reading. All the data I actually want to read is in the first 10% of the line. What I want to do is parse through the XML until a specific tag is found, in this case, <scanneddocuments> and then disregard everything after that.
I currently have it set up so the program grabs the whole line using getline() and creates a string of 5 million characters; not very efficient. This takes a noticeable amount of time, about 10 to 15 seconds. I want to read all the characters up until a specific tag (string) is found, but I'm not sure how to use a string as a delimiter. I know getline() or other related functions use characters as delimiters, but the main problem is that the data is all on one line. A sample xml file is below...
1. Determine length N of the tag
2. Read first N-1 characters into string "line".
3. If there are less, bail out
4. While (read one more char c)
4a append c to line
4b compare tag with last N characters of line. (Could use string::find)
4c if a match, break from loop
I would read the file tag by tag and add it to an output buffer until you find the tag you want.
Here is a simple example - maybe you can adjust it to your needs. It has hardly any error handling - so use with caution.