I am trying to understand the Wikipedia page concerning the Naive Bayes classifier used in Machine Learning here.
Supposing we have a vector $x = (x_1, \ldots, x_n) \in \mathbb{R}^n$. If we wish to know the probability that this vector belongs to some class $C$, then we are interested in the conditional probability $p(C|x) = \frac {p(x|C)p(C)}{p(X)}$ (RHS by Bayes Rule). The author writes the numerator is equivalent to the joint probability model:
$p(C,x) = p(x_1,\ldots,x_n,C) = p(x_1|x_2,\ldots,x_n,C) p(x_2|x_3, \ldots,x_n,C)\ldots p(x_{n-1}|x_n,C)p(x_n|C_k)p(C_k)$
(by iteration of the chain rule for probability).
Next the "naive" conditional independence assumption comes into play, such that all the features in $x$, (it's components each treated as random variables) are independent. Under this assumption $p(x_i|x_{i+1},\ldots,x_n,C) = p(x_i|C_k)$.
This last statement is where I am confused. I understand that if two random events $A,B$ are independent then $p(A\cap B)= p(A)p(B)$, and intuitively if we have three events $A,B,C$, then the probability that $A$ occurs given $B,C$ where $A$ and $B$ are independent should only depend on $A,C$, but I don't get how this is arrived at mathematically. In this scenario... $p(A|B,C) = \frac{p(A \cap B \cap C)}{p(B \cap C)}$, and in general
$P (A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)$
so that $p(A \cap B \cap C) = -(P(A)+P(B)+P(C)-P(A)P(B)-p(A \cap C) - p(B \cap C) + P(A \cup B \cup C))$
Any insights appreciated.