What is the equivalent of a z-statistic for a textual variable containing discrete values?

Question

I have a variable in my data which contains discrete values which have no canonical order, e.g. Apple, Orange, Pear.

These values appear with a certain frequency in my base sample. I have a subset of my sample which contains the same variable, and I would like to provide a measure of the similarity of the Fruit variable between the subset and the overall sample.

For continuous variables I use the z-stat and Kolmogorov-Smirnov, and I am looking for something equivalent for my Fruit variable.

I have considered ordering the values in the original sample by their frequency of occurrence and faking a CDF and using K-S, but that feels like a hack. Well, it would be a hack...

I could also invent something that takes a weighted difference of the populations, but I would rather use a conventional statistic if such a thing exists.

in my searches I have come across the Bhattacharyya distance and Kullback-Leibler divergence. I think one of these might do the trick, but I am going to have a hard time explaining to my less statistically savvy audience what they mean. I am hoping I am missing something more obvious. — Simon, Sep 12 '22 at 18:21
The chi-squared test works with categorial data. Will that suit your purposes? — Dan, Sep 12 '22 at 22:55

score 2 · Answer 1 · answered Sep 12 '22 at 22:51

2

For categorical data, you can use the multinomial test with the null hypothesis parameters set to the base sample frequencies.

It will tell you how "unusual" your sample is if we assume it were drawn from the base sample.

answered Sep 12 '22 at 22:51

Annika

6,873
1
9
20

What is the equivalent of a z-statistic for a textual variable containing discrete values?

1 Answers1