2

My question:

I have a list of numbers. This numbers are part of two accumulations, for each accumulation there is some unknown number of values around a specific average I don't know.

How can I find a threshold between those two accumulations, so I can say for every number if it's in accumulation $1$ or $2$?

Calculating the average of the two values forming the biggest jump would not work, it would be too unprecise.

Almost no numbers are the same, so it's originally not a bimodal distribution.

A computer should finally calculate this, so the way of doing this can be long.

The data is made by a human, pressing a button longly or shortly. The computer should detect if he means long or short, independently of the absolute length of the pressure.

Thanks for your advice.

fecavy
  • 41
  • This sounds like it falls into the hypothesis test framework: what's the probability that a given value was generated from one distribution vs. another one? To set up such a test you will need to say something about the distributions themselves. You don't necessarily know everything; maybe you know they are normally distributed but you have only some prior distribution for the mean and variance. But you'll need to know or at least assume something. – Ian Mar 08 '17 at 14:19
  • I would start by drawing a histogram of the data before thinking about the abstract question. Maybe the picture will show you a natural division. Check out bimodal distribution: https://en.wikipedia.org/wiki/Multimodal_distribution – Ethan Bolker Mar 08 '17 at 14:20
  • Yeah, that's an good idea. Calculating the minimum between the two maxima. The problem is, I think, that there are almost no values the same. Maybe it would make sense to reduce the "resolution" of the numbers by grouping them. – fecavy Mar 08 '17 at 14:32
  • If you have more than just a few numbers then building a histogram is just what you want to do to "reduce the resolution". Then the histogram might show bimodality. In general, modes are only good for grouped or imprecise data, otherwise random exact coincidences destroy the meaning you're looking for. – Ethan Bolker Mar 08 '17 at 14:53
  • There are so very many ways to do this. Without any more specifics of the data or the tools you have available it's quite difficult to answer. – mathreadler Mar 08 '17 at 14:55
  • I now provided some more information. – fecavy Mar 08 '17 at 15:07

2 Answers2

0

I already have an idea: Maybe I could "group" the numbers reducing their "resolution" and then calculate the threshold of the now bimodal distribution. But this "resolution" has to be right, if it's to small, the result would be too unprecise, if it's too high, the result could be totally wrong. I'm interested in your ideas :)

fecavy
  • 41
  • Please edit this idea into your question. It's not really an answer. And I don't think it's a particularly good idea. You should be able to make up some data where it gives you an answer that makes no sense. And it won't help if there is no good way to form your two groups exactly. – Ethan Bolker Mar 08 '17 at 14:27
  • I edited it now. – fecavy Mar 08 '17 at 14:41
  • Edit the question, not the answer. – mathreadler Mar 08 '17 at 14:57
0

The general concept you need is called discriminant analysis, pioneered by R. A. Fisher about 80 years ago. You can read about Fisheer's original discriminant analysis in the Wikipedia article. But your particular problem is the simplest possible case of discriminating between only two groups, so something like my simplified procedure suggested below might work.

In order for perfect discrimination to be possible the maximum values for 'short' pulses must be less than the minimum values for 'long' ones. Human subjects may initially have a variety of definitions of 'short' and 'long', so without some instruction, discrimination may not be possible.

You could start each subject's session with a sequence of five or so responses prompted to be 'long' intermixed with five prompted to be 'short'. Then you could see if further familiarization with the procedure is necessary.

A vastly simplified version of Fisher's discriminant analysis would be to take a point halfway between the means of short and long presses $\bar X_s$ and $\bar X_\ell,$ respectively, and see if that completely separates short from long. Because short pulses may have a smaller standard deviation (SD) $S_s$ than long ones $S_\ell,$ it may work better to see if the value $\bar X_s + cd$ is a suitable value for separation, where $d = \bar X_\ell - \bar X_s$ and $c = \frac{S_x}{S_x + X_\ell}.$

However, you have historical data values, $Y_s$'s and $Y_\ell$'s, of short and long pulse lengths, respectively. So, the 'familiarization' period might be shortened by demonstrating an ideal short pulse with a tone of length $\bar Y_s$ and an ideal long one with a tone of length $\bar Y_\ell.$ Then give the subject the opportunity to show a couple of pulses of both lengths. If he/she succeeds, the computer might say "You've got it." And if not, "I can't quite tell the difference, let's try a few more." before launching into the above-mentioned session with five of each type.

Because you haven't said much about the setting in which the long and short pulses are used, my exact suggestions may not be feasible. But the ideas are proven and sound, so I'm sure you can think of a way to modify them to fit your particular needs.

BruceET
  • 51,500