Sorry, I messed up the condition.
UTF-8 decoding works as a state machine.
1 2 3 4 5 6 7 8 9
|
while (!input.empty()){
byte b = input.pop();
if (b < 0x80)
output.push(b); // b is just an ASCII character.
if (input.empty())
throw InvalidInput(); // This is the check that seq2 fails.
byte second_byte_in_multibyte_sequence = input.pop();
// (Further decoding logic omitted.)
}
|
A valid UTF-8 sequence is composed of multiple multi-byte subsequences strung together. Decoding happens on a subsequence-by-subsequence basis, such that you can always successfully decode a UTF-8 sequence from the middle, as long as you start from the start of a multi-byte subsequence. A multibyte subsequence can start with an ASCII byte, in which case its length is 1. If the byte is non-ASCII, the length is strictly greater than 1, and it's encoded using the number of most significant bits that are turned on.
So {0xC2, 0xA2} and {0xC2, 0xA2, 0xC2, 0xA2} are valid, but {0xC2}, {0xA2}, {0xA2, 0xC2, 0xA2}, {0xC2, 0xA2, 0xA2} are not.