Given three numbers determine the fourth

Question

This is eluding me. I have four different databases with the following row/record counts (specified in the titles) with corresponding total database sizes. I need to determine what the database sizes will be when there are 50 million records.

Database/Size  Empty,    5k records,  10k records,  50 million records
A              7626 kB   20   MB      31   MB       ?
B              6298 kB   6298 kB      6298 kB       ?
C              7801 kB   7914 kB      7914 kB       ?
D              12   MB   1209 MB      2405 MB       ?

The sizes for databases B and C are obvious.

I don't think I can simply take the values found for 10k - 5k and use that as a delta and then multiply by the difference between 50 million and 10k (using 50 million would be close enough if I am on the correct path). Figuring this out will help our database administrator pre-size our databases.

I apologize that I have no idea what tag to use for this. =/

Have you ever heard of the method of interpolation and extrapolation? Your data set is far too small to be able to make any reasonable guess as to on what order the size will grow but one might think it would likely be that it would be linear. Do you know how to find the equation of a line given two points? — JMoravitz, Aug 24 '16 at 19:42
When you say the size for database C is obvious, you're implicitly saying that you can just extrapolate the delta between 10k and 5k and multiply it by the ratio between 50 million and 5k. — Erick Wong, Aug 24 '16 at 19:42
What makes you think that simply taking the values found for 10k - 5k and using that as a delta and then multiply (by about 10k) might be worse than any other approach? — Henry, Aug 24 '16 at 19:43
This question is not very mathematical as there is potential for implementation-specific dependencies. For instance, a database could choose to use narrower datatypes for indices when the number of rows falls below a certain threshold, making the interpolation from 10k to 50 million non-linear to an extent that can't be measured from the data you have given. In addition, many file formats will be block-based so that you won't see a change in file size unless you create a delta sufficiently large. — Erick Wong, Aug 24 '16 at 19:44
@ErickWong makes a good point. Do you really think that the size of the database for $C$ would still be $7914$ kB if it were being used for $10^{10^{100}}$ records? More records than bits? Seems rather fishy. — JMoravitz, Aug 24 '16 at 19:45
@Henry I believe that's because I so rarely work on even simple problems like this. Its been a long time since I've even had to touch problems like this. — D-Klotz, Aug 24 '16 at 19:46
@ErickWong and JMoravitz, I think you are both correct. Having a 1 Million data sample would be ideal and I may end up doing this. — D-Klotz, Aug 24 '16 at 19:53

score 1 · Accepted Answer · answered Aug 24 '16 at 19:45

1

I think you are intended to say for A that the increase is roughly 23MB/10k records, or 2.3 kb/record. This leads to about 115 GB at 50 million records. For D, the increase is about 24MB/10k records,so the size is close to the same. These, and your answers for B and C, assume that the size can be approximately represented by a model of some size to create the database and approximate linear increase per record. This may or may not be correct.

answered Aug 24 '16 at 19:45

Ross Millikan

374,822

Thank you. There will be variance in the size of each record but it isn't critical that we be exact. – D-Klotz Aug 24 '16 at 19:48

Given three numbers determine the fourth

1 Answers1