1

Sorry, I don't know the math words.

I have a data set that looks like the following. I am counting things, and the graph shows how many of each thing there is. for example, I have a LOT of thing #0, and not many of thing #300.

In my application, things that are ubiquitous are assumed to exist; I don't need to include them in my reports - they are just noise. In a real world example, if you listed all the things in the room in which you sit, most would list tables, chairs, computers, etc. Only the most contrarian person would list bacteria and air.

My complete graph has a much longer tail than you see here - 10x more data. The average for my data set is very low, something like 30. The standard deviation is about 300.

Is it mathematically sensible for me to choose a number of standard deviations and remove the data that is "to the left" of the cut-off? In my case, say I choose 4 standard deviations. 30 + 4*300 = 1230. So I'd cut off everything with a count greater than 1230.

Am I using this right?

EDIT

I am continuously modifying this data set. I want to recalculate the average and std dev values with some frequency.

enter image description here

  • 1
    Do you want to do this just one time, for a single set of data, or do you want a mathematical formulation that will be able to handle multiple instances of similarly shaped distributions? Because if it's the former, I would just look at your graph there and see that there are two obvious cut offs, one at around 6400 and another around 1900. – JonathanZ Jun 01 '22 at 14:08
  • 1
    And I hope you'll get responses from people with better stats chops than me, but my inexpert opinion is that standard deviation is a useful measure for quantities that cluster around a central value, and that's not your data. – JonathanZ Jun 01 '22 at 14:12
  • @JonathanZsupportsMonicaC to answer your question, I will be modifying the dataset frequently, and thus need to recalculate frequently. – Tony Ennis Jun 01 '22 at 14:13
  • ... and that's why I asked the question; I was wondering if I had to have a bell-curve sort of distribution... And not only that, the data I want is what is usually considered the noise! – Tony Ennis Jun 01 '22 at 14:14
  • 1
    Yeah, "bell curve" = "cluster around a central value". If you want terms to search on, you have a "long tail distribution", the standard example of which is a "power law", and you're looking to compute a reasonable "cut-off value". – JonathanZ Jun 01 '22 at 14:19
  • 1
    Chebyshev's inequality provides some justification for neglecting the possibility of data falling outside $\mu \pm K \sigma$ for large $K$ for any underlying distribution with finite variance. However the bound you get is only $1/K^2$, which is much worse than with something like the normal distribution. Nonetheless, any distribution with finite variance has a 99%+ probability of falling within 10 standard deviations of its mean. – Ian Jun 01 '22 at 14:19
  • 1
    (Cont.) Still, this is pessimistic; ideally you would be able to get a better estimate by making some assumptions about your distribution (e.g. fit it to a power law or a mixture of power laws). – Ian Jun 01 '22 at 14:20
  • Also, you say "continuously modifying"? So you might have 10,000 pieces of data, and then you add 500, and want to know what your new cut off value should be? A factor here is if you're content to recalculate from the beginning for your new data set, or if you feel you should be able to reuse the initial results and not have to rescan those 10,000 values. The latter are called "online algorithms", and there are nice ones for std. dev., but std. dev. probably isn't what you want. – JonathanZ Jun 01 '22 at 14:30
  • BTW, there is a stats.stackexchange.com that might be a better fit for this question. – JonathanZ Jun 01 '22 at 14:32
  • @JonathanZsupportsMonicaC regarding modifying, yes adding new data. I would only use the cut-off when displaying raw data - our front end 'turns black' from too many lines with all the actual data. For machine learning, I want all the data. the silicon monster will decide what's noise and what isn't. that is, the data isn't deleted, just ignored in some cases. – Tony Ennis Jun 01 '22 at 14:38

0 Answers0