Thursday, December 02, 2004

BruceZ on Std Deviation

Here's a repost of a detailed and simple explanation of standard deviation which I posted some time ago. Mason has an essay in the essay section which shows you how to compute it when your sessions are different numbers of hours. Later, I will update this thread with detailed information on how to construct an Excel spreadsheet to compute this easily, as well as a derivation of Mason's formula, and an alternative form of the formula. There is a thread in the probability forum right now that explains EV.

The formulas in the essay section may look fearsome, but I'll give a ridiculously simple example to illustrate what is really being computed.

Suppose you play three 4 hour sessions.

In the first session you win $200.
In the second session you win $400.
In the third session you lose $300.

Your average win or EV for these sessions is ($200+$400-$300)/3 = $100/session or $25/hour.

In the first session, you won $100 more than average.
In the second session you won $300 more than average.
In the third session you won $400 less than average.

Now take the SQUARE of these 3 differences from your average (100, 300, -400) to get

100^2 = 10,000
300^2 = 90,000
(-400)^2 = 160,000.

Note that it doesn't matter whether your differences are positive or negative since we are squaring them.

Now average these numbers to get your variance per session.

session variance = (10,000+90,000+160,000)/3 = 86,667.

Take the square root of this to get your standard deviation per session denoted by the Greek letter sigma.

session sigma = sqrt(86,667) = 294

Normally people refer to their standard deviation for 1 hour. A session here is 4 hours, but you cannot divide 294 by 4 to get your standard deviation for 1 hour. You have to divide it by the square root of 4 or 2, because your standard deviation increases as the square root of the number of hours you play. So your standard deviation for 1 hour is:

sigma = 294/2 = $147 for 1 hour.

An equivalent way we could have computed the variance is to simply average the square of our actual wins, rather than the square of our differences from our average, and then subtract from this the square of our average win.

session variance = (200^2 +400^2 + 300^2)/3 - 100^2 = 8667 as before.

I glossed over an important point that is often misunderstood. Many people refer to their standard deviation in units of bb/hr. This is incorrect, and they really mean that this is their standard deviation for exactly 1 hour as we computed here. Standard deviation does not have units of bb/hr, because if it did that would imply that you could simply multiply this number by the number of hours played to get your standard deviation for any number of hours. You actually must multiply it by the square root of the number of hours, so it has units of bb/sqrt(hr) or bb/hr^.5. You don't normally see it written this way, but you can see from the above that this is correct. The variance we computed has units of dollars^2/hr, so the standard deviation, which is the square root of the variance, has units of dollars/sqrt(hr). So if our true standard deviation were $147, and if we are going to play for 100 hours, and we want to know our standard deviation for that period of time, it is sqrt(100)*147 = 10*147 = $1470. The standard deviation only increases by a factor of 10 in 100 hours, but our average win increases by a factor of 100 to 100*$25 = $2500. So our average win increases faster than our standard deviation. This is why gambling "works" when you have an edge. In the beginning, your average win will be small compared to fluctuations caused by luck. Over time, your average win will grow relative to the fluctuations, and your results will be determined primarily by your edge, and the effect of luck will be proportionately smaller. The effect of luck will still be larger in absolute dollars, but it will be a smaller proportion of your win which will also be larger in absolute dollars.

Note in this example that your average win for 100 hours is already greater than your standard deviation for 100 hours. When your average win becomes exactly equal to your standard deviation, you will be ahead more than 84% of the time. This is because your results will lie within +/- 1 standard deviation from average 68% of the time, so 32% of the time they will lie outside this +/- interval. 16% of the time they will lie more than 1 standard deviation below the average, and 16% of the time they will lie more than 1 standard deviation above average. Assuming $147 represents your true standard deviation for 1 hour, after 100 hours your average win will be $2500/$1470 = 1.7 standard deviations. From a table of the standard normal distribution, or from Excel, we can determine that you will be ahead nearly 97% of the time at the end of this period. This is not a very realistic example for most people, and with a different standard deviation this situation could be much different.

To determine how long it will take for your average win to be 1 standard deviation, divide the square of your standard deviation for 1 hour by the square of your hourly rate.

hours to break even 84% of the time = (sigma/ev)^2.

This is one way to define the "long run". To find out how long it will take for your hourly rate to equal let's say 1.6 standard deviations, this is simply

hours to break even 95% of the time = (1.6*sigma/ev)^2.

At this point you have a 95% probability of being ahead.

When you use your standard deviation for 1 hour to compute your swings for longer lengths of time as the other posters have described, the results will be more accurate than when you only use it to estimate your swings for 1 hour. The reason is that in the long term, your results will closely resemble a normal or Gaussian distribution (bell curve), but in the short term this is not exactly the case. For example, a true normal distribution has tails that go off to infinity and minus infinity. You can't actually win or lose infinity in 1 hour, or even close to infinity The result is that the extra probability that would normally be in the tails of the curve get pushed in, making the tails thicker. This means that your swings for short periods of time are likely to be a little larger than what your standard deviation would suggest. If your results were truly normal, your swings would lie within +/-1 standard deviation of your average 68% of the time, and within +/- 2 standard deviations 95% of the time. Your average swing, which I computed recently on the probability forum, will be +/- 0.8 standard deviations. Your median swing, which is the swing you exceed exactly half the time, will be +/- .67 standard deviations. These estimates can give you a rough sense of how you are doing without a lot of calculations. Just remember that results in the short term are just crude estimates, and they are less reliable for reasons that have to do partly with statistics, and partly with your particular circumstances, such as being in a particularly wild game.

Now the above calculation does not produce your true standard deviation, but rather an estimate of it. 3 sessions is obviously not enough for this estimate to be very accurate, and in reality you would compute it over many more sessions. The more sessions you use, the more accurate this estimate will become. On the other hand, it takes relatively few sessions to determine an accurate estimate for your standard deviation compared to the number of sessions required to determine an accurate estimate of your hourly rate. This is a good reason to compute this statistic. While your average win may be somewhat uncertain, there is really no good reason we cannot have an accurate estimate of our standard deviation after a relatively small number of sessions.

Notice above that when I computed the variance, I divided by the number of sessions. Sometimes you will see people divide by the number of sessions minus 1, and sometimes by the number of sessions + 1. These have to do with different types of estimates, and these different estimates have different properties. These differences need not concern you, since the difference is small after the number of sessions becomes large. The estimate used in the essay section is intended to be a "maximum likelihood" estimate of the standard deviation given the data.

Here we have assumed that all sessions are the same length. The formula in the essay section allows you to adjust for variable length sessions. You simply need to log your win and the number of hours played for each session. We have assumed constant length sessions here for the purpose of clearly illustrating what standard deviation means. Namely, it is the "square root of the average of the squares of the differences from your average", or sometimes called a "root mean square" or rms average. Perhaps you have heard of rms voltage, or rms power on stereo speaker specs, the latter of which is actually something of a misnomer. The 120 volts AC you hear about is an rms average of the voltage used in the US. It is actually the standard deviation of the voltage, which is a sine wave with peaks at +/- 170 volts.

0 Comments:

Post a Comment

<< Home