I am trying to create an array for calculating allele frequencies from a text file with nucleotide data for multiple individuals from several different populations. Basically what I want to do is count the number of each allele (A, T, C, or G) that occur at each locus (position) for each population so that I can then calculate the frequency of each allele in each population.
Here is an example of a truncated version of the text file I want to read into the array:
>1-1_Sample 1
GCCCATGGCT
>2-1_Sample 1
GAGTGTATGT
>3-1_Sample 1
TGTTCTATCT
>1-1_Sample 2
GCTTAGCCAT
>2-1_Sample 2
TGTAGTCAGT
>3-1_Sample 2
GGGAACCAAG
>1-1_Sample 3
TGGAAGCGGT
>2-1_Sample 3
CGGGAGGAGA
>3-1_Sample 3
CTTCAGTTTT
Here the first letter after the ">" denotes an individual in the population, so here each population has 3 individuals. The sample number denotes which population the individual belongs to. So there are three populations here.
The string of letters after each header line is the nucleotide sequence. I want to read in the first letter from each sequence (G, G, and T for the first population) and then place them in an array that will calculate a total number of each allele (A, T, C, or G) in each position. So for example, population 1 has 2Gs and 1T in the first position, population 2 has 2Gs and 1T in the first position, and the 3rd population has 1T and 2 Cs in the first position.
How can I make an array to contain this data? I want to change the nucleotide data to numerical data, where A=0, C=1, G=2, and T=3, and then I need to total the number of 0s (As) in the first position for each population. I understand that this should be a 3D array with columns for each site, rows for each of the 4 nucleotides, and "stacked" for each population. I am not at all sure how to do this, I'm just starting out in C++ and this seems a bit difficult to read into a 3D array AND count the number of each nucleotide for each cell.
Please please please help me!
I need intelectual help getting over this obstacle :/
Yes, I've seen that post, my question is not how to make an array, but how to read in the nucleotide data I discussed above in an additive fashion. I'm having a hard time connecting the dots between setting up an array where you just read a text file directly into it for storage purposes, and being able to count the nucleotides into the correct cell of the array. I didn't bother writing out all of my code which I used to create an array of the proper size, etc. because that's not my question. I want someone to help me undertand a way to add nucleotides to either the A, T, C, or G cell of the array for each population. I've spent a great deal of time searching posts on this forum, and none of them address this issue, that's why I made this post.
Also, I'm not a student, but someone interested in learning more about C++ for research related reasons, I have no one in my working group I can ask, so I turned to the internet. I do know how to google.
You need to associate each letter with an index of an array:
1 2 3 4 5 6 7 8 9 10 11 12
constchar *phrase = "this is a phrase that does not mean anything";
size_t length = strlen(phrase); // size_t is of unsigned integral type
int letterCount[26];
for (int i = 0; i < 26; i++)
letterCount[i] = 0;
for (size_t i = 0; i < length; i++)
letterCount[ phrase[i] - 'a' ]++; // phrase must be all lowercase letters for this to work
for (int i = 0; i < 26; i++)
cout << (char)(i + 'a') << " count = " << letterCount[i] << endl;
Ok good, I was teasing about google, but you'd be surprised how many people want you to do it for them. especially kids.... Mine are the worst.
I was interested in your project, so I put this together while waiting for you to reply. I do not claim it is the best way, there are probably many better ways, but this seems to work with your sample data but don't think it will work with a larger file. without some edits, not sure, haven't tried it.
Try it with the sample file, try to understand what it does and ask if you have questions.
Comments were from my testing
Useage: Command <FileName.txt>
I noticed if you put in a improper file name, it runs anyway, but with zero data. should probably fix that.
You have private msgs turned off, but if you want to send me a msg I'd like to know who you work for and what this project is about. not necessary, I'm just curious if this is work for a DNA project or a personal interest.