2

I have asked questions about Number of unique integer in randomly generated arrays. Suppose we have $10^6$ random generated numbers, we should have about $~6*10^5$ unique numbers. I wrote a custom sort algorithm based on this assumption.

Pseudocode :

P := 0

RecursiveSort(Array A): If length of A <= 1: return A

Find maximum and minimum of A. Create Array B of length A.

P := P + length A // For evaluating purpose

For each number I in A: I := (I - min) / (max - min) // Rescaling I based on max and min of A K := I multiply by the length of A // 0<=K<=Length of A K := round(K) // Make K an index in Array B Put I in B[K]. // Each B[K] may hold several I

Create Array Result of length 0. For each array L in B: For each number I in RecursiveSort(Array L): Put I in Result

Return Result

After several runs, I noticed that P / length Awas approximately 1.73 for uniform distribution and 2.19 for normal distribution. Why is that so?

For uniform distribution, after each iterations the size of Array A reduce by $e$ based on the assumption, shouldn't P / length A be approximately 1.58 ($ 1 + e^{-1} + e^{-2} + e^{-3} + .. + e^{-n}$) ?

Here is actual python code:

import numpy as np
import random
A = []
lim = 1000000
#Generating normal distribution array.
# a_normal = np.random.normal(lim * 1000 / 2, lim * 1000 / 6, lim) 
# a_normal = np.random.normal(lim * 1000 / 6, lim * 1000 / 4, lim)
# a_normal = np.random.normal(0, lim , lim)
# A = a_normal.tolist()

for _ in range(lim): A.append(random.randrange(lim * 100, lim * 1000 - 1))

P = 0

def custom_sort(A): if (len(A)) <= 1: return A

mi = min(A) ma = max(A)

global P P += len(A)

if ma == mi: # if all numbers are the same return A

B = [[] for _ in range(len(a))] for I in A: K = int((i - mi) / (ma - mi) * (len(a) - 1)) B[K].append(I)

result = [] for L in B: for I in custom_sort(L): result.append(I) return result

sort(A) print(P / len(A))

Era
  • 43

1 Answers1

1

This is a bucket sort with $k$ (the number of buckets) equal to the input length. It runs in linear time if the input data is uniformly distributed, since all the buckets are small after the first pass.

You can optimize a bucket sort for any input distribution by using that distribution's CDF as the bucketing function; if the input matches the expected distribution, then the CDF values will be uniformly distributed between 0 and 1.

So e.g. if you expect input data to be normally distributed, you could start by computing the mean and variance of the data and then bucketing using a normal CDF with those parameters.

Karl
  • 11,446
  • 1
    I guess you're also asking about your theoretical estimate of the runtime constant factor for your recursive algorithm. Maybe someone else can answer that part. – Karl Aug 06 '21 at 17:51
  • I am curious about that. After setting array size to be $10^6$, and generated range from $(0, 10^6)$ to $(0, 10^{16})$, the sort functions took the same time (2 seconds) and return the same P / length A ~ (1.73). – Era Aug 06 '21 at 17:58
  • 2
    Hmm, if the input is perfectly uniform then the runtime should be $T(n)=n+kT(n/k)$, so $T(n)=n\log_kn$. It's like a Quicksort with $k-1$ pivots (instead of 1) at each stage. – Karl Aug 06 '21 at 18:06