Fun with statistics

I can’t say why I needed to know but lets just say I have been a little distracted recently as I have been working on a problem. It turns out I could map my problem into the German tank problem.

I had actually kind of pulled an equation out of the air and proclaimed (to myself), “this looks and feels right”. But I needed something more than my gut telling me that. It turns out for a uniform distribution (which, for the most part, my problem is) the best estimate of the true population size based on a limited sample of the population numbers is:

N = m + m/k – 1

Where ‘N’ is the population estimate, ‘m’ is the largest serial number of the samples you have and ‘k’ is the number of samples you have.

This could be used, presuming the serial numbers are sequential, to estimate the number of iPhones or Androids sold. This is far, far, from my application but still a fun application of statistics.

In my application I could substitute in an expression for ‘m’ which made my problem identical to the German tank problem. After rearranging the resulting equation I came up with the exact equation I had, essentially, pulled out of the air!

I’m still marveling at the implications of that result. In a few days I have a meeting with people who may or may not be thrilled to know that much of the work they have done for the past couple of years is bogus and that I have the solution to make it all better.


3 thoughts on “Fun with statistics

  1. Interesting, the more so in that I am worrying with a human population problem at the moment.

    What do you do when your serial numbers are skewed toward the first year(day,month) of production? As in a collector of Remington M51s, who collects only the early variants.

    Or a devastating event has taken out an age range? For example, a run of bad electrolytics on a a computer motherboard has taken out six month of X company’s production and you need MTBF to comply with DOD regs.

    Or, more to the point, what do you do when the camps have swallowed but not disgorged an unknown number of victims born between 1940 and 1956?


  2. Joe,

    How would you account for multiple serial number schemes for the same product? With factory proof marks it makes it relatively easy to say that a Mauser rifle was manufactured in Obendorf, Spandau, or Zastava, which means you just set up a different estimate equation to track the output of each factory.

    However, when factories put out the same product with a non-integrated serial number schema (or with a schema where factory A starts at 1, factory B starts at 1,000,000, and factory C starts with A1, and factory D starts with A1,000,000) that complicates the issue unless you know the number of factories producing items, and the particular schema of serialization.

    I’m not sure how you would account an unknown number of suppliers in the process, but now I have something to think about all day….

  3. Stranger,

    I would have to know more about the population probability density function in order to even take a guess.


    Yes. Of course. In those cases you have to know more about the factories and the numbering schemes.

    Now days more sophisticated manufactures will assign random numbers with an alternate random number chosen if there is a collision with a previously assigned number. Computers make it easy to assign random or to generate non-sequential, non-repeating numbers. The Germans, not having computers, probably didn’t see the threat and/or didn’t think the effort was worth the extra work.

Comments are closed.