2

Suppose we are given a sequence of positive numbers $0<a_1<a_2< \cdots < a_n$.

Step 1. Choose an integer $m$ where $m \in \{1,2,\cdots,n\}$. After choosing $m$, we divide our numbers into two subgroups $\{a_1,\cdots,a_m\}$ and $\{a_{m+1},\cdots,a_n\}$.

Step 2. Given two numbers $a', a'' \in \mathbb{R}$, we can calculate the sum of squares in the two subgroups $$SS = \sum_{i=1}^{m} (a_i-a')^2+\sum_{j=m+1}^n (a_j-a'')^2.$$

My question is how to choose $m, a', a''$ to minimize the $SS$.

The $a'$ and $a''$ are easy to solve. In fact, $a'$ should be the mean of the first subgroup, i.e. $\frac{a_1+\cdots + a_m}{m}$ and $a''$ should be the mean of the second subgroup, i.e. $\frac{a_{m+1}+\cdots + a_n}{n-m}$.

To solve $m$, I guess that we can firstly compute the mean $\bar{a}$ of the whole group, i.e. $\frac{a_1+\cdots+a_n}{n}$ and then $m$ should be the largest integer such that $a_m<\bar{a}$. But I don't know how to prove it.

afzd
  • 103
  • To the close-voter: I would understand voting to close as "belongs on Math.SE", but I don't quite understand what about this question "needs details or clarity". – Stephan Kolassa Jun 17 '20 at 06:27
  • Please do not cross post. It is against SE policy & wastes a lot of people's time. I'll migrate this to [math.SE] where the threads can be merged. –  Jun 17 '20 at 12:50

1 Answers1

1

The easiest way would likely be to calculate your target sum from step 2 for each $m\in\{1, \dots, n\}$, giving you $m$ sums $S_1, \dots, S_m$. Then just pick the $m$ giving you the smallest such $S_m$.

You can either do this straightforwardly or use an online algorithm to minimize the amount of calculation if this is an issue to you. Online estimation of variance with limited memory gives you an online way of calculating $\sum_{i=1}^m(a_i-a')^2$ for $m=1, \dots, n$ and should be easily modified for the second sum.

Unfortunately, it is not the case that $S_m$ is convex as a function of $m$ (which would allow you to do a faster search), e.g. for $$a=(0.01,0.09,0.48,0.49,0.55,0.6,0.86,0.95,0.97,1).$$

S

R code:

aa <- c(0.01, 0.09, 0.48, 0.49, 0.55, 0.6, 0.86, 0.95, 0.97, 1)
S <- sapply(seq_along(aa),function(ii)
    sum((aa[1:ii]-mean(aa[1:ii]))^2)+sum((aa[-(1:ii)]-mean(aa[-(1:ii)]))^2))
plot(S,type="o",xlab="m",pch=19)

In addition, your hypothesis (calculate $\bar{a}$, pick the largest $m$ such that $a_m<\bar{a}$) does not work, either: use

$$ a = (0.06, 0.26, 0.38, 0.51, 0.61, 0.76, 0.81, 0.94, 0.96, 0.98). $$

Then

$$ \bar{a} = 0.627\quad\text{and}\quad\max\{m|a_m<\bar{a}\}=5, $$

but (rounding to the third digit)

$$ S=(0.559, 0.371, 0.252, 0.214, 0.224, 0.334, 0.441, 0.622, 0.777, 0.916) $$

and

$$ S_4=\min S.$$

(Incidentally, note that in this case $S_m$ is convex, so your hypothesis does not even work in the convex case.)

  • I am very sorry. There is a typo in the question. I am looking for the smallest sum instead of the largest because in Random Forest when growing a decision tree, I need to minimize such a sum. – afzd Jun 17 '20 at 11:03
  • I edited my answer, it should be correct for your corrected question. – Stephan Kolassa Jun 17 '20 at 11:40
  • Note that on Math.se someone suggests in a comment that this is NP-hard. –  Jun 17 '20 at 12:35
  • If the $d$-dimensional analogue is NP-hard (per the commenter), that doesn't mean it is so in one dimension. The simplest solution is just to calculate the sums, which is polynomial in $n$, and do this for $m=1, \dots, n$. This is completely polynomial, even without looking at online/streaming algorithms. – Stephan Kolassa Jun 17 '20 at 12:55