Extracting text between 2 tags ??

Forum

Forum
General C++ Programming
Extracting text between 2 tags ??

Extracting text between 2 tags ??

Sep 18, 2009 at 8:38pm

Hi guys,
I have a question regarding extraction of text between 2 tags from a file. For eg: i have text file in which tags are defined like,,

<name>ABCD</name>
<Address>kjdksahfj</Address>
<text>jwefndjdnfa
eafsdljkfdn
basjlkfd
</text>

and so on

Two things i need to know...
1) i want to extract only text in the <name></name> tags how can i do it using C++..

2) Also is there a way is there a way to make the code dynamic... like instead of <name></name> it could be nething like <myname>ABCD</myname>..

Kindly help me out..

Regards,
Herat

Last edited on Sep 18, 2009 at 8:39pm

Sep 18, 2009 at 8:46pm

Bazzy (6281)

You can get an XML parsing library.
If you need something really simple follow this pseudocode:

find '<'
read until '>'
if what you read == "name" 
    read until '<'
    read until '>'
    if what you read == '/'+"name"// assume "name" to be an object of std::string
        you got your text
    else
        repeat from line 4
else
    repeat from line 1

Last edited on Sep 18, 2009 at 8:46pm

Sep 18, 2009 at 11:53pm

oleg (1)

Try this code. You first have to open an ifstream as a binary object (not text). You then call
findBlock(stream, yourStartTag, yourEndTag, startingOffsets, nextOffsets); Make sure that startingOffsets(0,0). After this call you should have nextOffsets contain the starting and ending offsets of your data. Next you call readLine. If the call is successful, the returned buffer will be non-empty and contain your text, including eol characters.

typedef std::vector<char> tCharContainer;
typedef std::pair<tINPStream::pos_type, tINPStream::pos_type> tDataBlockOffsets;
typedef std::pair<std::string, std::string> tINPTag;
struct WhiteSpace
{
typedef std::string range_type;
typedef std::string::const_iterator const_iterator;
typedef std::string::iterator iterator;
std::string whiteSpace;
WhiteSpace(size_t last)
{
for (size_t i=1; i<=last; ++i)
whiteSpace.push_back(i);
}
const std::string& operator()() const
{
return whiteSpace;
}
const const_iterator begin() const
{
return whiteSpace.begin();
}

iterator begin()
{
return whiteSpace.begin();
}
const const_iterator end() const
{
return whiteSpace.end();
}
iterator end()
{
return whiteSpace.end();
}
};

static const WhiteSpace g_WhiteSpace(32);
static const WhiteSpace g_WhiteSpaceNoBlank(31);

inline long skipWhiteSpace(tINPStream& stream)
{
long retVal=0;
tINPStream::char_type ch;
while (!stream.eof())
{
ch = stream.peek(); //inspect the next char in this stream. Don't remove it from the stream if it's not a white space
if (g_WhiteSpace().find_first_of(ch) != std::string::npos)
{
stream.ignore();
retVal++;
continue;
}
else
break;
}
return retVal;
}

inline tCharLocation locateCharacter(tINPStream& stream, char ch, tDataBlockOffsets& offset)
{
tCharLocation retVal(-1, false);
stream.clear(); //clear all bad flags
stream.seekg(offset.first, std::ios_base::beg); //seek to starting search location
char currentChar=' ';
while(!stream.eof() && offset.first < offset.second)
{
offset.first = stream.tellg();
if (skipEOL(stream))
{
offset.first = stream.tellg();
retVal.second = true;
return retVal;
}

currentChar = stream.peek();
if (ch == currentChar) //is this the char we're looking for?
{
retVal.first = stream.tellg();
break;
}
else
stream.ignore();
}
return retVal;
}

inline locateToken(tINPStream& stream, const std::string& tag, const tDataBlockOffsets& offset, const WhiteSpace& whiteSpace=g_WhiteSpace)
{
tINPStream::pos_type retVal(-1);
tCharContainer* p = new tCharContainer; //create a new container
p->resize(tag.size()); //reserve enough space to read the tag
tCharBufferRef iobuffer(p); //wrap it in a shared_ptr
stream.clear(); //clear all bad flags
stream.seekg(offset.first, std::ios_base::beg); //seek to starting search location
tDataBlockOffsets nextOffset(offset);
while(!stream.eof() && nextOffset.first < nextOffset.second)
{
charValue = locateCharacter(stream, tag[0], nextOffset);
if (charValue.first != -1) //is this the char we're looking for?
{
tSTDStringContainer tokens;
tINPStream::pos_type bytesRead = readDataUntilEOL(stream, *iobuffer, tokens, whiteSpace);
if (!tokens.empty() && tokens[0] == tag) //is this the token we're looking for?
{
retVal = stream.tellg() - bytesRead; //return position of the first character of the sought string
return retVal; //return the file offset immediatly past the sought string
}
else
{
for (size_t i=0; i<bytesRead; ++i)
stream.unget();
stream.ignore(); //discard the character that triggered this read
nextOffset.first = stream.tellg();
}
}
}
return retVal;
}
inline
bool findBlock(tINPStream& stream, const std::string& tagStart, const std::string& tagEnd, const tDataBlockOffsets& startingOffsets, tDataBlockOffsets& nextOffsets, const WhiteSpace& whiteSpace=g_WhiteSpace)
{
tCharBufferRef iobuffer;
tDataBlockOffsets nextLocation(startingOffsets);
stream.clear(); //clear all error flags
stream.seekg(startingOffsets.first); //starting point of the search
nextOffsets.first = locateToken(stream, tagStart, nextLocation);//offset to the begining of the target token
if (nextOffsets.first == -1)
return false;

nextLocation.first = stream.tellg(); //update search start
nextOffsets.second = locateToken(stream, tagEnd, nextLocation);//offset to the begining of the target token
if (nextOffsets.second == -1)
return false;

//nextOffsets contains two offsets. The first one is the begining of the first token. The second is the begining of second token
//we have to skip only the first token, since the length of the second token does not contribute to the offset.
nextOffsets.first = (streamoff)tokenStartOffset.second + skipWhiteSpace(stream); //skip eol characters if any
return true;
}

inline tCharBufferRef readLine(tINPStream& stream, const tDataBlockOffsets& offsets)
{
tCharContainer* p = new tCharContainer; //create a new container
p->resize(1); //reserve space to initiate a read
tCharBufferRef iobuffer(p); //wrap it in a shared_ptr
stream.seekg(offsets.first); //seek to the desired file offset
tINPStream::pos_type filePosition = stream.tellg(); //current file position
tINPStream::char_type ch;

while (!stream.eof() && filePosition < offsets.second) //do not read beyond the specified range or eof.
{
if (skipEOL(stream)) //is it an eol?
return iobuffer; //found eol. Return.

stream.read(&ch, sizeof(tINPStream::char_type));
iobuffer->push_back(ch); //acumualte non-skipable characters
filePosition = stream.tellg(); //update file postion
}
return iobuffer;
}

Sep 19, 2009 at 2:32am

jsmith (5804)

Wow, that's quite an impressive amount of code.

I _think_ this can be done in less than 5 lines of code using boost::regex.

Sep 19, 2009 at 9:52pm

herat007 (14)

Thanks for the replies guys.......... I have done it using stl and basic file operations... didnt want to make it too complex... and in my proj i cannot use third party programs like boost ...

Topic archived. No new replies allowed.