Finding a set of data points from scratch, to satisfy conditions of mean and standard deviation

Question

A question asked me to find a set of data points (numbers) with mean $50$ and standard deviation $8.75$ and it can be any number of data points.

My best attempt was guess and check, using $50$ and one value above and one value below (the different above and below would be the same). The standard deviation gets very close to $8.75$ but apparently it can't exactly be $8.75$ with just three data points. Is this true?

The ACTUAL question provided me a list of data points with known mean $48$ and known standard deviation $8.75$ (It was around $9$ data values), then it asked me to find a set of data points with the same standard deviation but mean $50$.

I will not provide the data values because my question is: can we come up with a list from scratch in an algebraic manner?

If not, then at least, what restrictions can we infer from the given information. For example (I have no idea if this is true or not) what if the new data set MUST be the same cardinality of the old data set, or something like that, in order for the standard deviation to be the same and the mean different.

NB: I already know of the correct way of how to solve it, by shifting all the data points up by two to make the mean from $48$ to $50$, while retaining the same standard deviation.

Some of the 'Related' links in the margin may give an idea how to do this, but none are exact 'Duplicates.' // My main numerical answer uses $n = 15$ observations, but my method would also work with $n = 3.$ (Also, with $n = 2,$ but (of course) not with $n = 1,$ — BruceET, Jun 07 '21 at 20:18
@SeanRoberson very interesting. Why is that so? If I add in any random constant (eg 10000), it will NOT change the variance? Seems hard to believe — user71207, Jun 08 '21 at 00:53
Intuitively, you're keeping all the data together so the spread doesn't change. Scaling everything by a constant, however, will affect variance. — Sean Roberson, Jun 08 '21 at 01:10
but we are adding a number that is a significant outlier, so won't that make the spread bigger? — user71207, Jun 08 '21 at 01:33

BruceET · Answer 1 · 2021-06-07T20:26:03.347

Suppose that the data are $X_1, X_2, \dots, X_n,$ with sample mean $\bar X$ and sample standard deviation $S_X.$

Then let $$Z_i = \frac{X_i - \bar X}{S_X}.$$ so that $Z_1, Z_2, \dots, Z_n$ have sample mean $\bar Z = 0$ and sample SD $S_Z = 1.$ [Sometimes this is called standardizing.]

Finally, let $$Y_i = 8.75\,S_Z + 50,$$ so that $\bar Y = 50$ and $S_Y = 8.75. [Sometimes this is called re-scaling.]

Example, using R as a calculator:

set.seed(2021)                   
   # for reproducibility 
x = sort(round(rexp(15),2));  x  
   # exponential data
 [1] 0.01 0.02 0.13 0.19 0.27 0.28 0.38 0.40
 [9] 0.63 0.64 0.70 0.76 1.19 1.22 1.50
z = (x - mean(x))/sd(x)
mean(z);  sd(z)
[1] 4.44812e-17  # essentially 0
[1] 1
y = 8.75*z + 50
mean(y);  sd(y)
[1] 50
[1] 8.75
y
 [1] 39.56985 39.76135 41.86780 43.01678
 [5] 44.54875 44.74024 46.65520 47.03820
 [9] 51.44260 51.63410 52.78307 53.93205
[13] 62.16638 62.74087 68.10275

Notes: (1) Some software will do some tests and other procedures, either based on summarized data or from the original data. (Minitab is one of them.) Some software requires the data. (R usually does.) So if you are using R and given only $n = 15, \bar Y = 50, S_Y = 8.75,$ and the fact that data came from a normal population, you can simulate a sample $X_i$ of size $n$ from any normal distribution, standardize, and re-scale as above to get appropriate data $Y$ as input to the desired procedure (t test, t interval, etc.).

(2) A potential problem, using summarized data $Y_i$ (for a procedure requiring normal data) is that there is no way to check whether the original data were approximately normal,

(3) You asked about the case $n = 3,$ explicitly. Here is my method. Just for variety, I started with a normal sample this time, but the method is the same.

set.seed(607)
x = sort(round(rnorm(3),2)); x
[1] -1.00  0.25  0.98
z = (x - mean(x))/sd(x)
mean(z);  sd(z)
[1] -9.251859e-18
[1] 1
y = 8.75*z + 50
mean(y);  sd(y)
[1] 50
[1] 8.75
y
[1] 40.59155 51.51467 57.89378

But if you round the $Y_i$ to two places, the sample mean and SD are no longer exact.

y2 = round(y,2);  y2
[1] 40.59 51.51 57.89
mean(y2);  sd(y2)
[1] 49.99667     # happens to round to 50.00
[1] 8.748722     $ and to 8.75

thanks, the rescaling point is what i was looking for. But I don't seem to get much information when i search "rescaling" on google — user71207, Jun 08 '21 at 01:34
Except for textbook problems, real-world data analysis uses actual data, sometimes transformed to help them match assumptions of analytic procedures. Also, various kinds of 'standardization' are often used. // But the exact task in your question does not arise often. Also 're-scaling' is more a plain English description than technical terminology. For these reasons, I'm not surprised you had difficulty getting the answer by googling. // Maybe best to think of this problem more as a way to get intuition about linear transformation of data than a useful procedure for data analysis. — BruceET, Jun 08 '21 at 05:01

Finding a set of data points from scratch, to satisfy conditions of mean and standard deviation

1 Answers1