Say I have a sentence consisting of n words.
That word can be
I like horse and I also like cats and kitten.
I want to compute the number of sequential words.
For example,
I, Like, Horse, are phrases with one word.
I Like, Like Horse, Horse and, are phrases with two words.
The total number of such phrases should be
n + (n-1) + (n-2) + ... +1
There are n phrases with 1 word, (n-1) phrases with 2 words, and 1 phrase with n words. So there are n sets and each set contain on average (n+1)/2 words.
So total there is n(n+1)/2 such phrases.
Say I want to limit the number of such sequences.
For example, do not compute every such sequences but only sequences less or equal to k.
For example, k =2 we got
I, Like, Horse, are phrases with one word.
I Like, Like Horse, Horse and, are phrases with one word.
I like Horse, like horse and, and everything else doesn't count.
So we got k sets and the number of element in each sets are n, n-1, n-2, ... n-k+1
When k=2, for example, we stop at n-1.
Hence the total number of such phrases are
k*(n+n-k+1)/2 = k*(2n-k+1)/2
I think my math so far is correct. When n is equal to k, it reduces to our previous formula of (n)(n+1)/2, when k is equal to 1, it reduces to simply n.
Now works on the formula we got
k*(2n-k+1)/2 = (2nk - k^2 +k)/2 = nk - .5*(k^2 +k)
Now this gets a little unintuitive.
Naturally we will think that the higher the k, the more phrases we have.
Yet the coefficient of k^2 is negative. It seems to me at some point, the higher the k, the less phrases we get.
How do we reconcile this problem?