Basics
So I don't know how much you know about digital audio. I'm going to assume it's not a lot... so let's start with some basic terms.
Multi-purpose digitial audio is typically stored in a format known as "PCM". I'll get into the details of it later, but basically you have a series of "samples" that join together to form an audible wave. Much like how "pixels" join together to form a picture.
The "sample rate" term that is thrown around determines how many samples are used to represent 1 second worth of audio. So a 44100 Hz sample rate dictates that if you have 44100 samples, then you have 1 second worth of audio. For another image analogy... you can think of this as being similar to "megapixels".
The "bits per sample" (bps) term implies the size of each sample. Higher = better quality. For yet another image analogy, bps can be compared to bits per pixel (ie: 24-bit ("true" color) vs. 8-bit color (only 256 possible colors).
The key difference between the two... is that unlike image data which is typically 2 dimentional (in both X and Y axis, to form a square)... PCM data is 1 dimentional (the only dimention is time).
PCM
So what exactly does each sample represent? You might think that the sample data contains things like tone and volume... but you'd be wrong! Or at least you would if that's what you're thinking.
Samples are just evenly spaced 'snapshots' (or... "samples"... the name is very apt) of a sound wave taken at different points in time. Each sample is just a numerical value, which has a upper-bound and a lower-bound, usually determined by the bps. For example, a 16-bit signed sample has a range of -32768 (low) to 32767 (high). A single sample would just be a number somewhere in that range.
For example... let's say we have the below series of samples:
5, 7, 8, 8, 9, 9, 8, 8, 7, 6, 4, 3, 2, 1, 1, 0, 0, 1, 1, 2, 3
If you take each of these samples and plot them on a grid... where the sample value is the Y axis and the X axis represents time (or the position of the sample)... you can just 'connect the dots' to map out a sound wave. In this case... it's a very crude imitation of a sine wave:
1 2 3 4 5 6 7 8 9 10
|
9 | **
8 | ** **
7 | * *
6 | *
5 | *
4 | *
3 | * *
2 | * *
1 | ** **
0 | **
|
This sound wave is ultimately be fed to your speakers, and basically is the pattern in which they will vibrate to produce the sound you hear.
"taller" waves are louder. "wider" waves are lower pitch. (This is a very simplistic and basic way to look at it, but it'll suffice for your purposes).
So how do you generate this audio?
htirwin shows how to generate a basic sine wave... so that's a good code example to start with. But let's try to explain what's going on behind it.
The most fundamental type of sound wave is a sine wave. There are reasons for this, but I only vaugly understand them myself. A sine wave has no edges (it is perfectly round) and therefore is the 'softest' and 'least complex' sound wave possible. So that's typcially where people start. You'll find that 'sharper' shaped waves tend to sound 'rough', whereas rounded ones tend to sound 'smooth'.
The "pitch" of a generated tone is measured in Hertz... or "times per second". The above illustrated psuedo-sine wave would be one cycle of that sound wave. You can repeat the pattern again and again:
1 2 3 4 5 6 7 8 9 10
|
9 | ** **
8 | ** ** ** **
7 | * * * *
6 | * *
5 | * *
4 | * *
3 | * * * *
2 | * * * *
1 | ** ** ** **
0 | ** **
|
The number of times you complete the full pattern in one second determines the "frequency" or "pitch" of the generated tone. If you complete this pattern 440 times in 1 second... then that produes a 440 Hz tone (concert A).
Remember that samples are a function of time... so if you have a sequence that is the right width to play at 440Hz but you do not loop it.. then it will only play for a fraction of a second before stopping.
In that same vein... you can't "hold" a note simply by outputting the same sample over and over. By doing that, you'll "flatline" your output and will not have motion. No motion = speakers stay still = no audio.
Mixing multiple sounds together
Broadly speaking, there are 2 ways to do this: In software and in hardware.
Hardware mixing is pretty easy. You just open two audio buffers (using whatever audio API you're using) and give it two independent sound waves. The sound card (hardware) will do the work of combining them together so that the user hears both of them at the same time.
Software mixing is when you combine the sound waves together into one complex wave
before sending it to the sound card. Naive software mixing is incredibly simple. To do it, you just add each wave's sample together. Really. That's it:
1 2 3 4 5
|
// combine wave1 and wave2 into a single 'outputwave'
for( ... each sample in the waves ... )
{
outputwave[i] = wave1[i] + wave2[i];
}
|
Now of course there are some caveats.
- both sound waves must be the same samplerate
- you have to be careful not to exceed the min/max allowed values for any given sample.
But whether you want to do software or hardware mixing depends on what you're doing.
Streaming audio over time
-- I'll get into this more tomorrow if there's still interest in the topic. But right now I'm tired and exhausted. And I'm not even sure if this info is going to be used or not.
But whatever. Let me know if you're still interested and I can keep going.