Reading enzyme acronym and recognition sequence from file

Mar 8, 2017 at 12:30am
I'm having problems getting the Acronym and Recognition sequence from a file.

Lets say I have these in my file:
BsaJI/C'CNNGG//
BsaWI/W'CCGGW//
BsaXI/ACNNNNNCTCCNNNNNNNNNN'/NNN'NNNNNNNGGAGNNNNNGT//
BsaXI/GGAGNNNNNGTNNNNNNNNNNNN'/NNN'NNNNNNNNNACNNNNNCTCC//
BsbI/CAACACNNNNNNNNNNNNNNNNNNNNN'/NN'NNNNNNNNNNNNNNNNNNNGTGTTG//
Bsc4I/CCNNNNN'NNGG//
BscAI/GCATCNNNN'NN/'NNNNNNGATGC//
BscGI/CCCGT/ACGGG//
Bse1I/ACTGGN'/NC'CAGT//


For example, BsaXI is the acronym.
The sequences are:
ACNNNNNCTCCNNNNNNNNNN'
NNN'NNNNNNNGGAGNNNNNGT
GGAGNNNNNGTNNNNNNNNNNNN'
NNN'NNNNNNNNNACNNNNNCTCC

Here's what I'm attempting to do, read up to the first '/' character, it'll be the acronym. I'll get the line, and get the substring up to '/' as the recognition sequence for the acronym. I keep parsing through the string and insert it into my data structure.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  ifstream theFile;
	  theFile.open(db_filename);
	  if(theFile.is_open()){
	    string aFileLine;
	    while(getline(theFile, aFileLine)){
	   	  // Sets Enymze Acronym
	      string anEnzymeAcronym = aFileLine.substr(0,aFileLine.find('/'));
	      string aRegoSequence;
	      // Keeps track of the starting position of the Recognition Sequence
	      int tracker = 0;
	      int wholeStringLength = aFileLine.length();
	      string test= "";
	      // While the tracker is not up to the 2nd last '/'
	      while(tracker != wholeStringLength - 2){
	      	string a = aFileLine.substr(tracker, aFileLine.find('/'));
	      	// Updates tracker position to read the next recognition sequence
	      	tracker += a.length()+2;
	      	aRegoSequence = aFileLine.substr(tracker, aFileLine.find('/'));
	      	// Creates a SequenceMap object with the recognization sequence and enzyme acronym 
	      	SequenceMap aSequenceMap(aRegoSequence, anEnzymeAcronym);
	      	// Inserts the sequence and acronym into the tree
	      	a_tree.insert(aSequenceMap);
	      }
	    }
	  }
	  else{
	    cout << "No file exists!\n";
	  }
Last edited on Mar 8, 2017 at 2:30am
Mar 8, 2017 at 2:22am
Since you don't post your whole code, I can only give you suggestions how to do it.

Prepare a vector to store the recognition sequences.

1. Read each line with std::getline()
2. Import the line you have read with a std::istringstream.
3. Do a std::getline() with delimiter '/' to get the enzyme name.
4. If the enzyme is really the acronym you need, then : 
    4.1. Continually do a std::getline() with delimiter '/' to get the recognition sequences, then push them into the vector.


If you follow these instructions closely you will never fail. Be sure to know what a std::istringstream is.
Last edited on Mar 8, 2017 at 2:23am
Mar 8, 2017 at 2:33am
I was planning to use to std::getline with a '/' as the delimiter, but I don't know how I can skip the enzyme acronym because if i do something like
 
getline(theFile, aRegoSequence, '/');

It's going to read the first '/' which would be the enzyme acronym, so I opted to make a counter and parse through the string and keep track of the position. I don't think storing it into a vector is necessary, since I created a sequence map with the sequence and the acronym and used my implemented insert function to place it into my tree.
Last edited on Mar 8, 2017 at 2:36am
Mar 8, 2017 at 2:37am
If you follow these instructions closely you will never fail.

I am implying that do you know std::istringstream? When you read a whole line with std::getline() pass the string to a std::istringstream then let it do the job.

getline(theFile, aRegoSequence, '/');

I never tell you to do it. I want you to read a whole line then pass it to std::istringstream, then you can use std::getline() with delimiter '/'.
Last edited on Mar 8, 2017 at 2:45am
Mar 8, 2017 at 3:28am
For your domain, it would be worthwhile to learn to use the regular expressions library.
http://en.cppreference.com/w/cpp/regex
It would come in very handy, over and over again.

Here's an example of using regular expressions to do this particular task:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <iostream>
#include <regex>
#include <string>
#include <map>
#include <set>
#include <sstream>

// rec_seq_map_type maps an acronym (key) to a set of all its recognition sequences
using rec_seq_map_type = std::map< std::string, std::set<std::string> > ;

rec_seq_map_type get_rec_seq_from( std::istream& stm )
{
    rec_seq_map_type map ;

    std::string line ;
    while( std::getline( stm, line ) )
    {
        // parse the line into / delimited tokens
        // a token is a sequence of one or more characters other than /
        const std::regex re( "[^/]+" ) ;
        std::sregex_iterator iter( line.begin(), line.end(), re ), end ;

        if( iter != end ) // if there is at least one match
        {
            // the first token, the key
            auto& set = map[ iter->str() ] ; // reference to the set associated with this key

            // the remaining tokens are the recognition sequences; insert them into the set
            for( ++iter ; iter != end ; ++iter ) set.insert( iter->str() ) ;
        }
    }

    return map ;
}

int main()
{
    std::istringstream file(
                "BsaJI/C'CNNGG//\n"
                "BsaWI/W'CCGGW//\n"
                "BsaXI/ACNNNNNCTCCNNNNNNNNNN'/NNN'NNNNNNNGGAGNNNNNGT//\n"
                "BsaXI/GGAGNNNNNGTNNNNNNNNNNNN'/NNN'NNNNNNNNNACNNNNNCTCC//\n"
                "BsbI/CAACACNNNNNNNNNNNNNNNNNNNNN'/NN'NNNNNNNNNNNNNNNNNNNGTGTTG//\n"
                "Bsc4I/CCNNNNN'NNGG//\n"
                "BscAI/GCATCNNNN'NN/'NNNNNNGATGC//\n"
                "BscGI/CCCGT/ACGGG//\n"
                "Bse1I/ACTGGN'/NC'CAGT//\n" ) ;

    const auto rec_seq_map = get_rec_seq_from(file) ;

    for( const auto& pair : rec_seq_map )
    {
        std::cout << "recognition sequences for acronym: " << pair.first << '\n' ;
        for( const auto& str : pair.second ) std::cout << str << '\n' ;
        std::cout << '\n' ;
    }
}

http://coliru.stacked-crooked.com/a/270d027939e06c39
Mar 8, 2017 at 5:57am
I attempted this in another way with an integer to keep track of where i parsed, but I'm getting out of bound error.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ifstream theFile;
	  theFile.open(db_filename);
	  if(theFile.is_open()){
	    string aFileLine, randomLine;
	    // Skip the first 10 lines of the files
	    for(int i = 0; i < 10; i++){
	    	getline(theFile, randomLine, '\n');
	    }
	    while(getline(theFile, aFileLine)){
	   	  // Sets Enymze Acronym
	      string anEnzymeAcronym = aFileLine.substr(0,aFileLine.find('/'));
	      string aRegoSequence;
	      // Keeps track of the starting position of the Recognition Sequence
	      int tracker = anEnzymeAcronym.length() +1;
	      // While the tracker is not up to the 2nd last '/'
	      while(tracker != aFileLine.length() - 1){
	      	string remainingString = aFileLine.substr(tracker);
	      	aRegoSequence = aFileLine.substr(tracker, remainingString.find('/'));
	      	// Updates tracker position to read the next recognition sequence
	      	tracker += aRegoSequence.length()+1;
	      	// Creates a SequenceMap object with the recognization sequence and enzyme acronym 
	      	SequenceMap aSequenceMap(aRegoSequence, anEnzymeAcronym);
	      	// Inserts the sequence and acronym into the tree
	      	a_tree.insert(aSequenceMap);
	      }
	    }
	  }
	  else{
	    cout << "No file exists!\n";
	  }


I'm not sure why this doesn't work, because I think my logic is correct. I tried using the same logic in cpp shell and it works perfectly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <iostream>
#include <string>
using namespace std;

int main()
{
  string wholeSequence = "AnSi/XDSACSD'CSADASDCm/SDASDASD'SDASXCZXm//";
  string ac = wholeSequence.substr(0, wholeSequence.find('/'));
  int tracker = ac.length() +1;
  cout << "Tracker length: " << tracker << endl;
  while( tracker != wholeSequence.length() -1){
    string remainingString = wholeSequence.substr(tracker);
    string en = wholeSequence.substr(tracker, remainingString.find('/'));
    tracker += en.length()+1;
    cout << en << endl;
    cout << "Tracker length: " << tracker << endl;
  }
  cout << "Whole Sequence Length: " << wholeSequence.length() << endl;
  cout << "end" << endl;
  
}


This is my SequenceMap class constructor
1
2
3
4
5
6
7
8
9
10
public:
SequenceMap(const std::string &a_rec_seq, const std::string &an_enz_acro) {

		recognition_sequence_ = a_rec_seq;
		enzyme_acronyms_.push_back(an_enz_acro);
	}
private:

	std::string recognition_sequence_;
	std::vector<std::string> enzyme_acronyms_;
Last edited on Mar 8, 2017 at 6:04am
Topic archived. No new replies allowed.