How to input different data types into array?

Help out the biologist please:

I have a tab-delim data set that looks like: (no acutal header row this is just for clarity)

Library Position Locus Gene Annotation
L3L1 12787 rcc_002305 abcD Does something or other
L4L2 197787 rcc_002904 efgH Does something else

I think? I need to create an array so that I can compare multiple lines at once (I tried fiddling with while..getline and chunking it into substrings but this ultimately only allows me to compare adjacent lines, and I will need to compare up to 8 at once in some cases).

I've tried fin<<str library<<int position<<string locus etc. but then I can't get the whole annotation line (spaces included) in as a single string; this also? only allows comparison of adjacent lines.

The comparisons are things like: output to file 1 if position is exactly the same in multiple lines, but library is different, output to file 2 if locus is the same but position is different.

Thanks in advance for your help -- I'm frustrated with the compiler barfing things like "incompatible types in assignment of 'const char[12] to char [4]"
Last edited on
Using the std::string library grants you access to std::getline, which has a argument for setting the delimiter.

Simply rewrite this to use std::getline.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <fstream>
#include <string>

void readFile(std::ifstream &fin, /* Other values here? Maybe a struct or an few arrays? */)
{
std::string lib, locus, gene, annot;
int pos;

// read in data
fin >> lib >> pos >> locus >> gene;
std::getline (fin, annot); // This should work since getline uses the input position
// then goes to it's delimiter (default newline).
}

int main()
{
std::ifstream fin("filename.ext");
readFile(fin, /* other arguments */);
// code here
}
wolfgang,

Thanks for the quick reply!

Your code will work great for obtaining the data from the line, but my issue is that I will need to compare the lines in groups: (numbered below for clarity)

1..L3L1 12787 rcc_002305 abcD Does something or other
2..L4L2 197787 rcc_002904 efgH Does something else
3..L3L2 198827 rcc_002953 ijkL Function goes here
4..L3L3 198827 rcc_002953 ijkL Function goes here
5..L4L1 207145 rcc_002977 mnoP Function goes here
6..L4L3 207277 rcc_002999 qrsT Function goes here
7..L4L3 207145 rcc_002977 mnoP Function goes here

I can use a while (getline) loop nested inside another to compare the next line (i.e. I would catch lines 3 and 4, but not 5 and 7).

So I thought to populate an array and compare the row in the array (i.e. while array[3][n] = "rcc_002977", check position) but I can't get different data types into an array easily.

Any ideas?

Thanks again

sicilicide
Multiple arrays or a structure that you define and make an array of them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <iostream>
#include <string>

struct bioData_t
{
	std::string lib, locus, gene, annot;
	int pos;
};
// Then make an array of them inside main()

int main()
{
	bioData_t study[10]; // An array of your new datatype.
	// Access members via:
	// study[INDEX].memberName
}
// Or multiple arrays, which if you imagine them as towers
// Line up across the rows at the same index.

std::string lib[10], locus[10], gene[10], annot[10];
int pos[10];

// Then to get a single line of data, you go across at the same index.
// This method may be mor difficult since if you need to sort them,
// You need to move all values at the same index on all arrays at the same time.
// Which can be messy.
// I recommend the struct for ease of use and clarity as
// each array member should be its own line of data. 
I can use the separate arrays method (i think) since my data is sorted to start with, but having trouble populating the individual arrays. This is what I have:

//count number of lines in input file (mutationlist.dat)
int size = 0;
string line;
while ( !fin1.eof() )
{
getline(fin1, line);
size++;
}

//reset input file to beginning
fin1.seekg(0, ios_base::beg);

//Make arrays for each data type:
std::string lib[size], locus[size], gene[size], annot[size];
int pos[size];

// read in data and populate arrays
for (int i=1; i<size; ++i)
{
fin1 >> lib[i] >> pos[i] >> locus [i]>> gene[i];
std::getline (fin1, annot[i]); //get rest of first line

Except now I get errors saying I'm trying to do invalid conversions from string to char...


sicilicide
Last edited on
Slightly illegal operation you have going on. In C++ you cannot have dynamic arrays unless you use pointers or vectors.

So to make your datatypes you'll need to make a fixed size and just have the size be the stop somewhere lower. (I mean make them constant 25 or something) and the count may be 18. The arrays need to be that constant 25, but since you re only using 18 you keep track of that for operations.

I'm having trouble seeing the error's location. Do you have more code and the actual line number where the error is?
Here's the whole works:

#include <fstream>
#include <string>
#include <cstdlib>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>

using namespace std;

int main()
{

//declare stream objects
ifstream fin1;//("mutationlist.dat", (ios::in));
ofstream fout1;//("common.out", (ios::out));
ofstream fout2;//("hits.out", (ios::out));
fin1.open("mutationlist.dat");
fout1.open("common.out");
fout2.open ("hits.out");

//count number of lines in input file (mutationlist.dat)
int size = 0;
string line;
while ( !fin1.eof() )
{
getline(fin1, line);
size++;
}

//reset input file to beginning
fin1.seekg(0, ios_base::beg);

//Make arrays for each data type:
std::string lib[size], locus[size], gene[size], annot[size];
int pos[size];

// read in data and populate arrays
for (int i=1; i<size; ++i)
{
fin1 >> lib[i] >> pos[i] >> locus [i]>> gene[i];
std::getline (fin1, annot[i]); //get rest of first line
}

//output to common.out for identical loci
for (int i=1; i<size; ++i)
{
if (pos[i]==pos[i+1])
{
***ERROR HERE*** fout1 << lib[i] << '\t'<< pos[i] << '\t'<< locus[i]< '\t'<< gene[i] << '\t'<< annot[i] << endl;
fout1 << lib[i+1] << '\t'<< pos[i+1] << '\t'<< locus[i+1] << '\t'<< gene[i+1] << '\t'<< annot[i+1] << endl;
}
}

//output to hits.out for mutations in same gene
for (int i=1; i<size; ++i)
{
for (int j=1; j<size; ++j)
{
if ((locus[i]==locus[j]) && (pos[i] != pos[j]))
fout2 << lib[i] << '\t'<< pos[i] << '\t'<< locus[i] << '\t'<< gene[i] << '\t'<< annot[i] << endl;
fout2 << lib[j] << '\t'<< pos[j] << '\t'<< locus[j] << '\t'<< gene[j] << '\t'<< annot[j] << endl;
//but array contains at least one of Library1,2 or 3,4
}

}
return 0;
}


I've marked the error line (line 49); the compiler returns the following:
[Error] no match for 'operator<<' in "\011 << gene[i]"

There's also a whole bunch of other stuff looks like template<class_CharT, class Traits, class_Alloc, and links to lines in basic_string.h.

I don't understand why this isn't returned for the other arrays (pos, locus, etc.)

Thanks again for your help -- much closer to the answer thanks to your ideas.

Best,
sicilicide
You're missing a '<' on the one right before it. Look closely after locus[i].
*facepalm*

Thanks.
Array indexing starts at 0, not 1. Using the code tag for your code will preserve the formatting and make your posts much easier to take in.

Consider the following. I didn't change any of the logic you used for deciding which lines to output to your common and hits files, but it did look to me like you probably aren't outputting quite what you want to. Comments in the code address this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#include <vector>
#include <fstream>
#include <string>
#include <iostream>

struct Line
{
 	std::string lib ;
	int pos ;
	std::string loc ;
	std::string gene ;
	std::string annot ;
};

// extract a line from an input stream
std::istream& read_line(std::istream&is, Line& l)
{
	is >> l.lib >> l.pos >> l.loc >> l.gene ;
	return std::getline(is, l.annot) ;
}

// insert a line into an output stream
std::ostream& write_line(std::ostream&os, const Line& l)
{
	return os << l.lib << '\t' << l.pos<< '\t' << l.loc 
		      << '\t' << l.gene << '\t' << l.annot << '\n' ;
}


int main()
{
	std::vector<Line> lines;

	// fill the vector with info from our input file
	std::ifstream in("mutationlist.dat") ;
	Line curLine ;
	while ( read_line(in, curLine) )
		lines.push_back(curLine) ;

	// output lines with identicle loci
	// logic looks questionable here.
	// could end up with multiple
	// copies of the same line.
	std::ofstream common("common.out") ;
	for (unsigned i=0 ;  i < lines.size()-1 ; ++i)
	{
		if (lines[i].pos == lines[i+1].pos)
		{
			write_line(common, lines[i]) ;
			write_line(common, lines[i+1]) ;
		}
	}

	// output mutations in the same gene
	// logic looks questionable here as well.
	// compares every line to every other line
	// twice and to itself once.
	std::ofstream hits("hits.out") ;
	for (unsigned i=0; i < lines.size(); ++i)
	{
		for (unsigned j=0; j <lines.size(); ++j)
		{
			const Line& a = lines[i] ;
			const Line& b = lines[j] ;

			// following if in original code was
			// missing brackets around the body.
			// didn't look intentional, so fixed
			// here.
			if(a.loc == b.loc && a.pos != b.pos)
			{
				write_line(hits, a) ;
				write_line(hits, b) ;
			}

		}
	}
}
cire,

Thanks for your comments. I was indeed looking at the same line twice, and getting duplicate output copies and self-comparisons.

I caught my fencepost error last night when a sample input file didn't parse properly, so I made a number of changes to the output so that I access array [i-1] or [j-1] instead. I was also missing the last line of my file so I run the loops now to i<=size.

Not sure what you mean by "code tags" or I would show you what I mean.

Best,

sicilicide
Ok, making some progress with this but stuck again. My "common" output is working OK, but to eliminate inclusion of these lines in the second ouput file, I added another array item called "marker" but these don't seem to be working properly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <fstream>
#include <string>
#include <cstdlib>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>

using namespace std;
ifstream fin1;
ofstream fout1;
ofstream fout2;
int main()
{

//declare stream objects

fin1.open("mutationlist.dat", ios::in);
fout1.open("common.out", ios::out);
fout2.open ("hits.out", ios::out);

//count number of lines in input file (mutationlist.dat)
int size = 0;
string line;
while ( !fin1.eof() )
	{
	getline(fin1, line);
	size++;
	}
 
//reset input file to beginning
fin1.clear();
fin1.seekg(0,std::ios::beg);

//Make arrays for each data type:
std::string lib[size], locus[size], gene[size], annot[size];
int pos[size], ilib[size], marker[size]; //marker = 0 (unassigned), 1 (assigned)

// read in data and populate arrays
for (int i=1; i<=size; ++i)
	{
         fin1 >> lib[i-1] >> pos[i-1] >> locus [i-1]>> gene[i-1]; //get first four data columns
         std::getline (fin1, annot[i-1]); //get rest of first line
         string library = lib[i-1].substr (3,1);
         ilib[i-1] = atoi (library.c_str()); //obtain integer value of library
         marker[i-1]=0; //assign unmarked value for all lines
         if (ilib[i-1] < 3) //assign real value to library
            {ilib[i-1] = 8;}
         else
             {ilib[i-1] = 13;} 
         
	}

//output to common.out for identical loci
for (int i=1; i<=size; ++i)
	{
		for (int j=i; j<size; ++j)
			{
				if ((pos[i-1] == pos[j]) && (ilib[i-1] != ilib[j])) //same mutation, different libraries, unmarked
					{
					marker[i-1]=1;
					marker[j]=1;
					fout1 << lib[i-1] << '\t'<< pos[i-1] << '\t'<< locus[i-1] << '\t'<< gene[i-1] << '\t'<< annot[i-1] << marker[i-1]<<endl;
					fout1 << lib[j] << '\t'<< pos[j] << '\t'<< locus[j] << '\t'<< gene[j] << '\t'<< annot[j] << marker[j]<<endl;
					}					
			}
	}
	
//output to hits.out for mutations in same gene
for (int i=1; i<(size+1); ++i)
	{
		for (int j=i; j<size; ++j)
			{
               if ((locus[i-1]==locus[j]) && (pos[i-1] != pos[j]) && (ilib[i-1] != ilib[j]) && (marker[i-1] = 0) && (marker[j]=0)) //same locus different mutations, different libraries, unmarked
					{
						fout2 << lib[i-1] << '\t'<< pos[i-1] << '\t'<< locus[i-1] << '\t'<< gene[i-1] << '\t'<< annot[i-1] << endl;
						fout2 << lib[j-1] << '\t'<< pos[j-1] << '\t'<< locus[j-1] << '\t'<< gene[j-1] << '\t'<< annot[j-1] << endl;
					}					
			}
	}
	
return 0;				
}

I'm using this input file:
L4L1 3071 rcc00003 recF Line1 DNA replication should go to common
L3L3 3071 rcc00003 recF Line2 DNA replication should go to common
L4L1 3265 rcc00003 recF Line3 DNA replication should go to hits
L4L3 3266 rcc00003 recF Line4 DNA replication should go to hits
L3L2 3904 rcc00004 abcD Line5 DNA replication should be unmatched
L3L1 3904 rcc00004 abcD Line6 DNA replication should be unmatched
L3L1 3920 rcc00005 efgH Line7 DNA replication should go to hits twice
L3L3 6685 rcc00005 efgH Line8 nitrilotriacetate should go to hits twice
L2L4 6938 rcc00005 efgH Line9 nitrilotriacetate should go to hits twice

but the problem seems to be with the marker array.

I put some test output lines in to see where things are (these are omited above) but the first time around I succesfully change markers for lines 1 and 2 (lines 58 & 59) for that iteration from 0 to 1. But the next time this line is examined, the marker value is 0 again, so I skip output to hits.txt when I shouldn't.

Can you spot my error please?

Thanks
sicilicide
I see you figured out the code tags.

I'm surprised you're sticking with the design where input and output are scattered throughout the code and you use a bunch of different arrays. It really detracts from the readability.

Use operator== for equality comparisons.

1
2
3
4
if ((locus[i-1]==locus[j]) && (pos[i-1] != pos[j]) && (ilib[i-1] != ilib[j]) && (marker[i-1] = 0) && (marker[j]=0))

// marker[i-1]=0 sets marker[i-1] to 0.
// marker[j]=0 sets marker[j] to 0. 


If your compiler's warning level is set high enough it should generate a warning for that.


Last edited on
Thanks, cire.

You might have supposed I've have caught that, with the equality comparison done correctly earlier in the line. Sigh.

Not sure what you mean by input & output "scattered throughout the code" -- seems to me like I do all the input together, and then output into each of the files.

As far as the different array method, wolfgang suggested a single array method (above) but this way seemed simpler to me.

I'm sure a pro could really pick apart my code and I suppose this sounds like heresy but I'm happy with "it works" and not sure I have the time for the effort it would take to get me to "elegant".

Thanks to you both for all your help -- sure beats manually annotating and sorting 12k SNPs.

Sicilicide
Well, if you look at the code I posted earlier with the Line struct, I think you'll see what I mean.

There aren't so many chunks of identical code scattered through the program (and so errors involving those chunks will only occur in one place in the code -- much easier to fix.) Accessing one array is much more convenient (and less prone to error) than using many that, although logically connected, don't have that connection enforced via a language construct. It also makes the logic of the code in main easier to follow because the sheer amount of code is reduced substantially.

It's a fairly short program, so these design decisions probably aren't terribly important, but you might be surprised how often you end up reusing these types of code snippets.
cire,

I can see that your code is shorter & cleaner.

The truth is I picked up my first C++ book a week ago and I didn't understand your code, which is not to say I doubted it. I thought it would be more dangerous to cut and paste code whose function I didn't understand, than to write code whose function I could mentally follow, however unesthetic. Obviously doing so opens the door to errors in comparison of array[12] with array[13], but I figured if I was careful I could avoid this, and I think I have.

I appreciate you taking time time to help.

sicilicide
Topic archived. No new replies allowed.