Probability that random byte array is a valid UTF-8 string?

Question

What is the probability that $n$-byte random byte array is a valid UTF-8 string?

It doesn't care if it's NFC or NFD.

If you expect answers, I think you should make your problem somewhat more well-posed. For example: Is NFC the same as NSA and is Edward Snowden involved? What is a byte again in bits? What do you mean by valid? — mathse, Apr 12 '14 at 10:16
Do you only allow 4 byte sequences and the legal code point end at U+10FFFF? i.e are you talking about the UTF-8 specified in RFC 3629? — achille hui, Apr 12 '14 at 10:24
I am sorry about the misunderstanding about NFC and NFD. achille hui gave me the answer what I really wanted. Since unicode normalizing requires some kind of database, it will become needlessly complicated (and it will spin out from a mathematics question) if the probability target is stricted into one of them. — Izumi Kawashima, Apr 14 '14 at 10:58

score 8 · Accepted Answer · edited Jul 06 '22 at 17:54

I will ignore the part about NFC and NFD in the question. Browsing over the standards, none of these normalized forms are closed under string concatenation nor there are simple criteria to determine whether something is in the normalized forms without some sort of database lookup.

I will stick to what is considered to be legal UTF-8 as specified in RFC 3629. In particular, we will ignore the possibility that the data stream begins with a BOM (byte order marker).

According to RFC 3629, UTF-8 encodes each of the $1,\!112,\!064$ valid code points in the Unicode code space using one to four 8-bit bytes. They have following bit patterns

$$ \begin{array}{|c:c:l|} \hline\\ k&\alpha_k&\text{bit patterns}\\ \hline\\ 1&128& \text{0xxxxxxx}\\ 2&30\times 64 & \text{110xxxxx}\; \text{10xxxxxx}\\ 3&15\times 64^2& \text{1110xxxx}\; \text{10xxxxxx}\; \text{10xxxxxx}\\ 4&4\times 64^3 & \text{11110xxx}\; \text{10xxxxxx}\; \text{10xxxxxx}\; \text{10xxxxxx}\\ \hline \end{array} $$ Not all the bit patterns corresponds to valid unicode code points. The column $\alpha_k$ above is the number of valid code points which uses $k$ bytes for encoding.

For $n \in \mathbb{Z}_{+}$, let $N_n$ and $P_n = \frac{N_n}{256^n}$ be the number and probability of valid UTF-8 string composed of $n$ bytes. Let $$ A(t) = \frac{\alpha_1}{256} t + \frac{\alpha_2}{256^2} t^2 + \frac{\alpha_3}{256^3} t^3 + \frac{\alpha_4}{256^4} t^4 = \frac{t}{2}+\frac{15}{512} t^2+ \frac{15}{4096}t^3+\frac{1}{4096}t^4. $$ By building up UTF-8 string from the one to four bytes UTF-8 representations of valid code points, we find:

$$1 + \sum_{n=1}^\infty N_n t^n = 1 + A(256t) + A(256t)^2 + A(256 t)^3 + \cdots = \frac{1}{1-A(256t)}$$

This leads to $$1 + \sum_{n=1}^\infty P_n t^n = \frac{1}{1-A\left(t\right)} = \frac{1}{1-\frac{t}{2}-\frac{15}{512} t^2- \frac{15}{4096}t^3-\frac{1}{4096}t^4} \tag{*1} $$ and hence the probabilities $P_n$ satisfy following recurrence relation:

$$P_n - \frac12 P_{n-1} - \frac{15}{512} P_{n-2} - \frac{15}{4096} P_{n-3} - \frac{1}{4096} P_{n-4} = \begin{cases}0,& n > 0\\1,& n = 0\end{cases}\tag{*2}$$

One can then use this recurrence relation to compute $P_n$ for any $n$ explicitly. Please note that for simplicity of presentation, we have set $P_0 = 1$ and $P_k = 0$ for negative $k$ in $(*2)$.

To deduce the asymptotic behavior of $P_n$. Let $\lambda_1, \lambda_2, \lambda_3, \lambda_4$ be the four roots of the equation $A(t) = 1$. Numerically, they are $$\begin{cases} \lambda_1 &= 1.770796029915762\\ \lambda_2 &= -16.29524359356811\\ \lambda_3, \lambda_4 &= -0.23777621817382 \pm 11.91183775521532 i \end{cases}$$ Notice $$\frac{1}{1-A(t)} = \frac{1}{\prod_{i=1}^4\left( 1 - \lambda_i^{-1} t\right)} = \sum_{i=1}^4 \frac{1}{\lambda_i A'(\lambda_i)\left(1 - \lambda_i^{-1} t\right)} = \sum_{i=1}^4 \frac{1}{\lambda_i A'(\lambda_i)}\sum_{n=0}^\infty \frac{t^n}{\lambda_i^n} $$ This leads to $$P_n = \sum_{i=1}^4 \frac{1}{\lambda_i A'(\lambda_i)}\lambda_i^{-n}$$

For large $n$, above expression will be dominated by the contribution from $\lambda_1$, the smallest root in magnitude. As a result, we get following approximation:

$$P_n \sim \frac{1}{\lambda_1A'(\lambda_1)} \lambda_1^{-n} \sim 0.87739479563671 \times 0.56471777839234^n\tag{*3} $$

Following is a table of the exact and approximated values of $P_n$ for $1 \le n \le 10$. As one can see, the approximation $(*3)$ is pretty good, the relative error falls below $10^{-4}$ starting at $n = 3$.

$$ \begin{array}{|r:cl:l|} \hline\\ n & P_n &&P_n (\text{approx})\\ \hline\\ 1 & \frac{1}{2} &= 0.5 & 0.49548043976496\\ 2 & \frac{143}{512} &= 0.279296875 & 0.27980661318093\\ 3 & \frac{647}{4096} &= 0.157958984375 & 0.15801176897502\\ 4 & \frac{23393}{262144} &= 0.089237213134766 & 0.089232055135417\\ 5 & \frac{52839}{1048576} &= 0.05039119720459 & 0.050390927937455\\ 6 & \frac{3819383}{134217728} &= 0.028456620872021 & 0.028456652875968\\ 7 & \frac{17255005}{1073741824} &= 0.016069975681603 & 0.016069977792599\\ 8 & \frac{623629417}{68719476736} &= 0.0090750024100998 & 0.0090750021578506\\ 9 & \frac{176087305}{34359738368} &= 0.0051248150703032 & 0.0051248150574871\\ 10 &\frac{101826182527}{35184372088832}&= 0.0028940741721897 & 0.0028940741739357\\ \hline \end{array} $$

Finally, there is one more point I would like to comment.

The exact value of $P_2$ is $\frac{143}{512}$. Out of the $2^{16}$ random two-byte byte-sequences, $2^{14}$ of them are valid UTF-8 strings because they composed of purely US-ASCII characters. If one subtract these from $P_2$, we find the probability of a random two-byte byte-sequence which is not pure ASCII being valid UTF-8 is

$$\frac{\frac{143}{512}-\frac14}{1 - \frac14} = \frac{5}{128} \sim 3.90625\%$$

Reproducing what has been quoted on the wiki page of UTF-8.

Probability that random byte array is a valid UTF-8 string?

1 Answers1