So you're trying to programatically create a video with music that can then be uploaded to a place like youtube, right? +1 good question, seems like fun. And by fun I mean living hell if you're trying to figure out the video format from scratch. But luckily you probably don't have to.
I don't have much experience with this, but I hope your question gets more attention.
I would check out Matroska on GitHub to check out their libraries that can parse/write .mkv files.
https://github.com/Matroska-Org
But just a guessy suggestion because I have not used their parsing utilities. I skimmed through the code and some parts of their interface seem very complicated, even for a non-beginner.
I have used ffmpeg to roughly stitch still animation frames together to make a video.
I believe I started by searching the question on StackOverflow and got a result probably similar to this question:
https://stackoverflow.com/questions/24961127/ffmpeg-create-video-from-images
so I would check that link out.
The results aren't always pretty, so unless you do the right settings, the video tends to look atrocious when re-encoded on youtube. But I probably was just using bad settings, so experiment with different options until you get what you like. The other problem with this is that storing those images, especially if we're working in the 30+ fps range, can really add up for high-def videos. Slow to work with since you have to write each image to a file before calling ffmpeg. At least the way I was doing it.
Professional video-creating software like Adobe Premiere uses multi-core rendering and GPU-acceleration, which I have no idea is possible with ffmpeg. Might be.
And then there are more commands in ffmpeg to mux in an audio channel/subtitles, I'm almost positive (but I haven't used those commands).
https://superuser.com/questions/277642/how-to-merge-audio-and-video-file-in-ffmpeg
@mbozzi
<audio frames to image data> | <image data to video> |
So does that mean you had to write each image frame to file in order to stitch it together with ffmpeg? Is that the standard way of doing this? Or did you do it in some memory-only way similar to how Windows Media Player has its visualizer?