I'm looking to write a program in C/C++ to traverse a Fasta file formatted like:
>ID and header information
SEQUENCE1
>ID and header information
SEQUENCE2
and so on
in order to find all unique sequences (check if subset of any other sequence) and write unique sequences (and all headers) to an output file.
My approach was:
Prep: Copy all sequences to an array/list at the beginning (more efficient way to do this?)
Grab header, append it to output file, compare sequence for that header to everything in the list/array. If unique, write it under the header, if not, go on.
However, I'm a little unsure as to how to approach reading the lines in properly. I need to read the top line for the header, and then "return?" to the next line to read the sequence. Sometimes the sequence spans more then two lines, so would I use > (from the example above) as a delimiter? If I use C++, I imagine I'd use iostreams to do the reading.
If anybody could give me a nudge in the right direction as to how I would want to read the information I need to manipulate/how to carry out the comparison, it'd be greatly appreciated.