I'm a tad unsure of exactly your goal here. The reason I'm unsure is that "Vision Computing" and OpenCV itself are subjects in the branch of AI in the recognition of image content, like identifying objects in a scene, or finding the edges of an object, or even taking several images of an object taken from different angles and forming a 3D model of the object.
This quote, however, suggests something quite different:
from iostream, to opening different file types (mp4, jpg, etc), manipulating pixel and binary data, writing them, etc. |
These are topics in the general domain of C++ (iostream), or file formats (mp4), or simple image convolution (manipulating pixels), or file manipulation (writing them, which is related to iostream).
The reason I'm uncertain is that these subjects are in the study of beginning and intermediate programming and basic computer science, while "Vision Computing" is a much, much higher level of computer science in the domain of AI.
It sounds like you're asking how a beginner can approach computer vision. First, you must learn the basic computer science and programming in the line of those points like image manipulation and the use of iostream. Factually, without those basics, even relying upon OpenCV would be impossible.
What I think you are asking, ultimately, is how to learn (here must assert 'after the computer programming basics') the science of AI recognizing images.
I would suggest, at that point, to start with character recognition of a simple set, and how a simple neural net can be trained to recognize the digits 0 to 9 to about 94% accuracy. This one learning step can be taken early in programming, without really understanding file formats like jpg, mp4 or image processing (like brightness and contrast). All that must be understood is the basic nature of a black and white or grayscale image, where each pixel in the image is a number representing how "white" it is (0 is black, 127 is 50% gray, and 255 is all white, for example). From there you can witness what happens when a neural net is fed images of the digits 0 to 9, carefully framed in a fixed image size, for recognition TRAINING.
To be clear, at the risk of being too long, you first need to understand neurons and connections between neurons. An artificial neuron is implemented as a piece of data representing an input (that number from 0 to 255 indicating how white a pixel is, for example), and will "fire" an output based on a function (one that may 'interpret' the input). That function can be so many options that the statement is vague, but leave that as an "algebraic unknown" in your mind for the moment. Picture the neuron as a circle. Assume an image (for character recognition) that is 28 x 28 pixels. This image will be swapped repeatedly with various images of digits in a loop (perhaps scanned from handwritten samples). Assume each character is centered and scaled to fill this 28 x 28 pixel image.
Fashion 784 neurons in a column, with one pixel attached to each neuron. The neuron's input is the "whiteness" value of that pixel. This is an AI 'retina', receiving the image.
Fashion a second column of neurons to the right of the input 'retina'. The number should exceed the input volume, but is not critical. 1200 will do, but it could be 2000, or any value between. Perhaps the system will eventually be more accurate with more neurons, but not by much. Let's say 1200 for now.
Now, with programming techniques taken from intermediate skills in C++, or C#, or Java, CONNECT the output of the retina neurons to this second column (usually called a layer, more specifically named a hidden layer - a poor choice for the name, but that's what it's called). The connections follow a simple pattern. Take the first (top) neuron of the retina, and connect one "line" or "wire" from that neuron to each of the 1200 neurons in the second layer. Move to the second 'retina' neuron and repeat.
In this way each 'retina' neuron will connect to 1200 'hidden layer' neurons. Each 'hidden layer' neuron will receive 784 'retina' outputs. For each of these connections you will fashion a 'weight' as it is called. This 'weight' is like a volume control. If the value is 100%, the entire output from the connected 'retina' neuron is received. If the value is 0%, that 'retina' neuron is ignored. The setting is to be used later. They are the "magic" of training a neural net.
Now, fashion an output layer. For this goal, recognizing digits 0 to 9, you'll need only 10 neurons. Each represents a digit, the first one is 0, the second is 1, etc....
Connect the 'hidden layer' neurons as before. Each of the 1200 'hidden layer' neurons will have a 'wire' connected to each of the output neurons, each with their own 'weight' (another volume control). Each 'output' neuron will have a connection coming from each of the 1200 'hidden layer' neurons.
This is a simple neural net construction. Now, the 'magic' (if that's how one views the subject) begins. All of the weights are initially randomized around 50%. That is, all weights between the 'retina' layer and the 'hidden layer' will be set to random values centered around 50%, with no more than 15% variance. These are not critical (it could be 20%). Do the same for the weights between the 'hidden layer' and the output layer.
This is an untrained network.
Now, feed in one image. The output layers will "fire" - based on that vague function I mentioned (there are some standards used which you'll find in research literature), the output neurons will "turn on" (or stay off) in meaningless ways (they'll be wrong). Feed in a 4, the net may think it could be a 5 or a 7, maybe a 2. It's untrained.
You then correct the output with a technique called back propagation (more study, a bit of math, a process too complex to list here). This basically is a way to work backwards from the output layer to the hidden layer, figuring out which 'weights' are "too high" or "too low" to give the correct result. This "training" step will make some correction, turning off the wrong answers, turning on the right answers in the output layer.
This is repeated for every character. Typical training sessions run this training test some 60,000 times, making small corrections in the weights.
Eventually, the neural net "learns" (it is adjusted by back propagation) to recognize the characters.
This is the fundamental beginning of image recognition. It is too simple to recognize the face of one person from another, but it is enough to recognize hand written digits to about 94% or more.
Your own study would be, thereafter, to learn how to improve the neural functions, the training methods to improve accuracy.
So, what makes this work?
Have you ever taken aluminum foil, placed it over a coil and then 'rubbed' the foil until it looks just like the coin?
Imagine the neural net is a stiff net cloth (sprayed with a glue). The 'concept' (in this case digits 0 to 9) is the coin. The net cloth is the aluminum foil.
Training is, then, pushing on this net repeatedly, over every detail of the 'concept', until the 'shape' of the net fits the concept.
That's what adjusting the weights does. It makes the AI net conform to the concept, so it 'resembles' the concept.
Once you get that part, you're ready to dive into more advanced network configuration and more advanced training methods (deep learning, for example).
Then, you'd learn how to connect multiple nets, each trained for specific purposes, into larger collections performing increasingly complex tasks.
You're in for a wild ride, too. It is surprising how simple nets perform as well as they do, and then just how difficult it can be for some concepts to be represented.
Then, back to the "computer science and programming" subject, you'd need to learn how to code AI processing in the GPU for high performance.
Best of luck.