storing character/string input from ifstream object

Hi all,

Complete newbie here, I am a statistician trying to write a program to process a very large data file. I'd be very grateful for advice from any of you proper programmers! Full code is shown at the end of the post.

I am trying to get some text input (gene names and snp names) from a couple of different files and store the information in some data structures (strings, I think) so that I can later compare input from another very large input file to these (still to be coded!). I think strings would be handy so I can use strcmp(). My problem is handling the input from my ifstream object. At first I tried the >> operator but that doesn't seem to work with my ifstream object (see commented out code & error message). I'm not sure why this is as this page http://www.cplusplus.com/reference/iostream/ifstream/ seems to suggest it should. I then decided to used getline and access the characters I needed. Getline works to a point - I seem to be able to put the whole input line into a string (see first bit of code using inFile1), but I have problems when I am trying to break it up and store it in different structures (second bit using inFile2). I think I'm using the indeces wrongly. I expected the cout statements to show a nice list of gene names (first loop) and snp names (second loop). Instead the first cout statement produces:

N
S
G
0
0
0
0
0
2
1
5
7
7
8
\342
\217

repeated over and over again. 'ENSG00000215778' is the name of the last gene in the file (note dropped 'E'). The second produces

\342
\217








\342
\217

repeated many times. I've tried changing the indexing around but I can't get anything more sensible.

Any help with either improving this code or trying a different approach would be very gratefully received.

Thanks,
Jen

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main (int argc, char * const argv[]) {

const int MAXLENGTH = 70;
const int MAXCHARS = 67;
const int NGENES = 1165;
const int SNAMELENGTH = 13;
const int GNAMELENGTH = 16;
const int NSNPS = 15010;
ifstream inFile1;
ifstream inFile2;
int i=0; //counts lines in infiles
int j=0;
int k=0;
char line[MAXCHARS]; //holds whole line of input to be split into 3 vars
char gene2snp[MAXLENGTH] = "gene2snps_sorted.txt";
char genelist[MAXLENGTH] = "genelist.txt";
string genes[NGENES]; //array of strings holding sorted gene names
string g2ssnp[NSNPS][SNAMELENGTH]; //array of strings holding sorted snp names
string g2sgene[NSNPS][GNAMELENGTH]; //array of strings holding sorted gene names associated with g2ssnp[]

//read in list of gene names to genes[]
inFile1.open(genelist, ios::in);

if (inFile1.fail()) //check for successful open
{
std::cout<<"\nThe file was not successfully opened. Please check it exists." << endl;
exit(1);
}

while (!inFile1.eof()) //check for end of file
{
// inFile1 >> genes;*this commented-out code doesn’t work
inFile1.getline(line, GNAMELENGTH,'\n');
genes[i] = line;
i++;
}

inFile1.close();

cout << genes[0] << '\n' << genes[10];

i=0;

//open gene2snp list for reading in
inFile2.open(gene2snp, ios::in);

if (inFile2.fail()) //check for successful open
{
std::cout<<"\nThe file was not successfully opened. Please check it exists." << endl;
exit(1);
}

while (!inFile2.eof()) //check for end of file
{
// inFile2 >> g2sgene >> g2ssnp; doesn’t work - error message 'error: no match for 'operator>>' in 'inFile2' >> g2sgene'
inFile2.getline(line, GNAMELENGTH,'\n');
{
//put chars 0-15 and 17-28 into g2sgene and g2ssnp
for(i=0;i<15;i++)
{
g2sgene[i][j]=line[i];
cout << g2sgene[i][j]<<'\n';
}

for(i=17;i<27;i++)
{
g2ssnp[k][i-17]=line[i];
cout << g2ssnp[k][i-17] << '\n';
}
j++; k++;
}
}
inFile2.close();
}

I don't completly understand your code (I didnt look very closely at it), but I've two questions:
How do the input-files look like?
How should the output look like?
(Examples please)

Generally, you can use get() to get input from a file (I prefere this above getline() because I feel like I've more control about what's going on, aldo that's probably just me):
1
2
3
4
5
6
7
8
ifstream infile;
infile.open(/*arguments*/);
...//check or opening succed
while (infile.good())
{
char temp=infile.get();
...//do whatever you want to do with the character
}
Thanks for that Scipio, I could try get() and deal with each character at a time. Will post if I get it working.

A bit more about my data.

For the first input (inFile1) the input file is a list (about a thousand) of 15-character gene names:

ENSG00000000078
ENSG00000000189
ENSG00000000224

etc. I don't want to output this to a file, I want to store the gene names in an array of strings: string genes[NGENES]

For the second input, the file has gene names with many repeats, plus another 'column' holding names of 'snps' (a snp is a tiny piece of information from inside of a gene). So this file actually lists which snps are within which genes:

ENSG00000000078 rs10008
ENSG00000000078 rs129788
ENSG00000000189 rs200898

So, snps rs10008 and rs129788 are in gene ENSG...78. From this input, I want to put all the gene names into an array of strings (g2sgene) and the snp names into another (g2ssnp). I will eventually use these arrays to look up values from another input file (e.g. will get rs10008 from an input file, find it in g2ssnp, get corresponding value from g2sgene). So if all of these are strings I can just use strcmp() when looking them up.

Hope that makes sense, and I appreciate the time you've spent on this.
Last edited on
Oke, so you got a list of genes (ENSG...), all listed in FileA and to each of those codes belong one or several codes (rs...), listed in FileB. Right?

If the number of snp names is fixed, you could use a twodimensinal array:string information[NUMBER_OF_GENES][2] (I assumed it would be two because of your example). Of course, you could also do this when the number of snp names is not fixed but bounded to a maximum. This would be the easiest solution I think.

If the number of snp names isnt fixed or bounded to a maximum, you could use vectors or someting like that. We'll look how to figure that out when it's necasary.

Let me know or you can figure the input out.
Thanks for your help Scipio. My problem turned out to be not the structure I was storing into, but the way I was extracting the stream. I got the input in as a string by using getline() instead of ifsteram.getline(). I'm still having problems storing a substring into a string array though. Here is some much-reduced code that is hopefully clearer. The output (below the code) only displays the first gene name, first snp name, then half of the second snp name. Could anyone let me know why all of the file is not being read in - or is it being read in, but not displayed? Am I incrementing i in the right place (have tried others with no success)?

Many thanks to all who look

#include <iostream>
#include <string>
#include <fstream>

using namespace std;

int main (int argc, char * const argv[]) {

const int INPUTROWS = 6;
const int FNAMELENGTH = 17;
int i=0;
ifstream infile;
char filename[FNAMELENGTH] = "/input_test.txt";
string line;
string gnames[INPUTROWS];
string snames[INPUTROWS];

infile.open(filename, ios::in);

if (infile.fail()) //check for successful open
{
std::cout<<"\nThe file was not successfully opened. Please check it exists." << endl;
exit(1);
}

while (!infile.eof()) //check for end of file
{
getline(infile, line);
{
gnames[i]=line.substr(0, 15);
snames[i]=line.substr(17, 12);
cout<< gnames[i] << " " << snames[i] << '\n';
i++;
}
}

infile.close();

return 0;
}


Output:

[Session started at 2009-01-15 15:01:05 +0000.]
ENSG00000073945 rs10008
ENSG

The Debugger has exited with status 0.
Okay, got it working using the >> operator which I tried 2 days ago - didn't work then, does work now, and no idea why!! Nevermind, I'm just glad it's functioning.

Thanks for your help Scipio.
Instead of using >> you should use getline();

http://www.cplusplus.com/forum/articles/6046/
Zaita - thanks for your reply. I've read that posting and I understand it, but think my problem is not using getline to get the initial string (I eventually 'found' the function getline() on Wed night and got it working in a simple case), it's how to process the input after. I tried assigning substrings of the input line to 2 different arrays of strings and it didn't work. Possibly I'm not understanding how getline() 'loops' through the lines in a file, so maybe my index 'i' is not being used correctly.

All of this code (and the incorrect output) is in my posting from Jan 15 @ 3:06pm, when I was asking how to make getline work this way. I'm happy to admit that >> isn't the best way, but could you possibly look at this post and tell me why getline() & the subsequent string assignments aren't working? Otherwise I may have to leave >> in there just to carry on and get this working. I'd be grateful for any corrections or just to be pointed in the right direction as I'd rather learn the 'right' way to do this.

Many thanks for your time,
Jen
Hi Jen,
Ironically, my job is to stop statisticians writing code :P I'm a software dev/scientific programmer.

Anyways, I have written a small piece of code to illustrate a better method of memory allocation, data storage, string splitting and file reading.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <iostream>
#include <string>
#include <fstream>
#include <vector>

using namespace std;

struct dataLine {
  string gNames;
  string sNames;
};

int main (int argc, char * const argv[]) {

  vector<dataLine*> dataList;

  ifstream infile("input.txt");
  if (!infile) {
    std::cout<<"\nThe file was not successfully opened. Please check it exists." << endl;
    return 1;
  }

  string line = "";
  while (getline(infile, line)) {
    int spaceLocation = line.find_first_of(' ');
    dataLine *newLine = new dataLine(); // Allocate memory for new line
    newLine->gNames = line.substr(0, spaceLocation);
    newLine->sNames = line.substr(spaceLocation+1, line.length()-spaceLocation);

    dataList.push_back(newLine); // Add the new line to our vector
    cout << "Read: " << newLine->gNames << " - " << newLine->sNames << endl;
  }
  infile.close();

  // Free the memory to prevent a leak
  for (unsigned i = 0; i < dataList.size(); ++i)
    delete dataList[i];

  return 0;
}


The file I used as input was:
ENSG00000000078 rs10008
ENSG00000000078 rs129788
ENSG00000000189 rs200898


And the output was:
Read: ENSG00000000078 - rs10008
Read: ENSG00000000078 - rs129788
Read: ENSG00000000189 - rs200898


May I ask what organisation you work for?
Hmm, I think stopping me from writing code would be a great idea! Sadly we're academics (I'm doing a PhD in statistical genetics in London, UK) and we don't have programmers working with us. I'd gladly hand it over if we did! However I am auditing some programming courses and working hard to learn things the right way so hopefully in the next 3-4 years I'll become a decent programmer.

Thank you very much for the above code. I'll have to take it away to digest it & try it out. I do have my full working code now but it is slow (not good with 83m line input file!), so I'll probably take advantage of your advice and your code.

Best wishes,
Jen

ps if you have any advice for practical books for scientific programmers I'd love to hear it. I am considering 'C++ for Mathematicians', by Edward Scheinerman.
closed account (z05DSL3A)
I'm not condoning the downloading of copyrighted materials, but the following has a link to download some books one of which is 'C++ for Mathematicians', by Edward Scheinerman. If you wanted to scan a few chapters to see if you like it.

http://www.cplusplus.com/forum/lounge/6775/

Thanks Grey Wolf, I will investigate.
Topic archived. No new replies allowed.