August 12, 2010

Ts and Hs

Imagine I'm a professor teaching the basics of probability. One day, I give the class this assignment: sit down at home, toss a coin a hundred times and write down each result, Heads or Tails. Bring me that sequence of Hs and Ts tomorrow.

Now I suspect most of the class is too lazy to toss a coin a hundred times. It's likely a lot of them will simply write out what looks like a random sequence of Hs and Ts and turn that in.

But I am almost sure that, with a quick look at each submitted sequence, I will be able to tell which ones were produced in this fake way, and which ones are actually a faithful record of a hundred coin tosses.

How do I tell the fake sequences from the genuine ones?

(Thanks to my charming cousin for reminding me of this last month.)


Kovendhan has a comment with the right answer: "In order to appear genuine, students will avoid long sequences (6 or more) of consecutive heads or tails. In reality, such long sequences almost always occur in long trials."

Which of course begs the bonus question: in a sequence of 100 coin tosses, what's the probability that there will be at least one sequence of 6 heads or tails?

All right, bonus question #2: what if I changed the "6" immediately above to "6 or more"?


Boskoe said...

The 'truly random' sequences will have about 50-50 distribution of Hs and Ts.

Subruk said...

i think i know the answer, but maybe i am expected to, having studied probability theory in a little more detail than your average reader.

since i don't want to ruin it for the other readers, am not posting it. but nice question anyway. i have a nice related story that i shall comment once people start posting answers.

Jai_C said...

It will be difficult for anybody to come up with a fake record that they know has to be close to 50-50.

If they try they will fall into patterns of some kind that can be easily discerned. If they dont they will be off by some considerable margin.

They may even realize it and then you see at around 90+, a long sequence of H (or T) to get the total closer to 50-50?


Kovendhan said...

In order to appear genuine, students will avoid long sequences (6 or more) of consecutive heads or tails. In reality, such long sequences almost always occur in long trials.

Just shows that, in reality, randomness is less (or more) random than our notion (even educated) of randomness (as illustrated by Benford's law).

Dilip D'Souza said...

Someone has just hit on the correct answer: I am holding it in moderation (sorry K) so that others can give it a thought.

Subruk, I'm looking forward to your story.

Boskoe, fair enough, but I said a "quick look" -- meaning hardly enough to get an idea if it is 50-50.

Dilip D'Souza said...

Excellent, Kovendhan!

Please check the bonus questions added to the post.

Kovendhan said...

At any step, the probability that there will be 6 consecutive heads or tails is (0.5)^6 = 0.016.
The probability of the above event not happening at a particular step is 1 - 0.016 = 0.984.
The probability of the event not happening in any of the first 94 steps is (0.984)^94 = 0.219.
The answer is therefore 1 - 0.219 = 0.781.

The answer is the same for "6 or more". For more than 6 heads (or tails), you need to get 6 in the first place.

Thanks for the questions, Dilip.

Rahul Siddharthan said...

Nice questions, but I have to quibble with the answers.

Minor quibble: Kovendhan's answer is good, but is an approximation since the different sets of 6 successive tosses in 100 are not independent. Eg, "HTTTTT" is a bad sequence but there is a 50% chance that the next 6-mer will be "TTTTTT". So it somewhat underestimates the probability of getting 6 or more heads. My numerical experiments suggest the true probability is above 0.8 (sorry, offhand I don't see an easy way to calculate it exactly).

He is also right that "six or more" is the same as six, unless you meant exactly 6, which was not obvious from your statement.

Major quibble: This leaves a nearly 20% chance that no such sequence will occur, which is hardly a reliable way to distinguish a fake answer. (If the entire class were honest, you'd unfairly mark one in five as a cheat). If you look for 5 or more heads/tails, there is a less than 5% of that not occurring, which is still not very convincing. Plus, students could well put in five or more by chance.

In fact you can very reliably determine human-generated sequences from truly random sequences, but you have to be more sophisticated than this. A combination of several statistical properties of random sequences can do it (even if each property has a 20% chance of occurring in an honest sequence, if you test 10 such properties the chances of *all* of them being satisfied in a random sequence are vanishingly small.)

Lastly -- if I were a lazy student, I'd use a pseudorandom number generator to get the sequence...

Subruk said...

Let me address Rahul's quibbles. There is more to the minor quibble. It could well be overestimating too, for a sequence like HTHTTH cannot generate 6 heads or tails with the next toss.

The major quibble isn't that big an issue I think. I think that a student would try to "enforce" randomness, and I am not sure if he would put in a sequence of heads/tails of length more than 3. A visual inspection would be more than sufficient. One could also count the number of 4-sequences and 5-sequences. I feel that these might not appear at all/appear very infrequently in the concocted random sequence.

Let me give the promised story. It's about a famous Computer Scientist/Mathematician who has a penchant for gadgets. He once constructed this gadget which, on the press of a button, would generate a random 2-digit number. He was using this to generate a long random sequence of 2-digit numbers for an experiment. He was copying down the numbers after each button press.

After he had reached sufficient length, he looked at his piece of paper and thought a while. He realized his folly and threw the paper into trash. Basically, his list of numbers had no consecutive repeats, whereas if the numbers were truly random it should have had many such repeats. The issue was not with the gadget. While taking down the numbers, when he pressed the button, if he saw the same number, he thought that he had forgotten to press, or not pressed the button hard enough for it to register. And pressed it again to get a fresh number. So every natural repetition of the numbers didn't show up in his final list. It's interesting to note how our perceived notion of randomness is different from the actual notion, and can lead us to interesting actions.

Anonymous said...

I think a much simpler intiution for the non-mathematically inclined reader is that its very hard to manufacture a fair coin. Diaconis places special orders to get his fair coins manufactured in the lab. Ergo, every coin out there is unfair. Ergo, you'll get long series of H's & T's if you use a real coin. Most students don't know of these things & will simply intersperse H's & T's.

Kovendhan said...

Rahul, you are right. I overlooked conditional probabilities. The answer is an underestimate.

Subruk, the probability calculated at each step is for the next 6 steps, starting with the current one. Not as you have implied.

Dilip D'Souza said...

While I admit the veracity of the quibbles (I'm ashamed of that sentence!), the point is not to definitively identify what's a genuine sequence and what's not, but what subruk gets at above. Somebody concocting what she thinks is a random sequence of 100 H and T will almost certainly not put down 6 H or 6 T in a row (and for that matter, probably not 5 or 4 in a row either). Because we think to ourselves that that is "not random".

It would be interesting to see someone knowledgeable about probability concoct such a sequence, and then subject that one to tests of "randomness". What signs would such a person leave, if any, of it being concocted?

Rahul Siddharthan said...

Sukrut is right, it is not all that obvious whether 0.781 is an overestimate or underestimate (but the numerics say it is an underestimate).

I think I can see how to calculate it exactly but it requires careful counting: will update if I get around to doing it later.

I think a generic human-generated sequence can be detected by "eyeballing" (if the one with the eyeballs has some expertise). Whether a human can train himself to generate random sequences that fools an expert at a glance, I don't know. I'm pretty sure that nobody can generate large sets of random numbers that would fool the statistical tests...

Rahul Siddharthan said...

OK, my exact answer for "the probability of 6 or more successive heads (or tails) in 100 tosses" is
= 0.8054 approximately.
This is in very good agreement with what I got numerically and not very different from Kovendhan's approximate answer.

But the explanation probably requires a blog post in itself, which I may write in a day or two. Judging by that formula, there is no simple way to obtain it -- unless (a) I missed something and my answer is therefore an approximation too (which is entirely possible); and (b) the real answer is much easier to obtain (which I doubt).

Thanks for the interesting problem!

Anonymous said...

Why is the answer to the first bonus question of exactly six successive heads or tails not

P = (2*2^94)/2^100 ?

RS to explain why I am wrong after he checks me with a simulation and then to show me the right answer.

Rahul Siddharthan said...

Anon: why would that answer be right? It looks like the probability that the first six are heads (or tails). Or a specific six. And even then, it is six or more - not exactly six.

And no, multiplying by 95 (the number of ways you can pick 6 successive tosses) doesn't work and in fact gives you a number greater than 1. Figuring out why is an exercise for the reader.

The probability that there is exactly one instance of exactly six successive heads (or tails) somewhere in the sequence isn't any easier than the 'six or greater' problem: one can do it but I don't think it will be very illuminating.

as said...

The gods of google can give us a cryptic answer:

Apparently there exists a general formula to determine the probability of any consecutive run within a given coin toss

Rahul Siddharthan said...

as - thanks, very interesting. Perhaps Feller's textbook derives the formula (but I wonder if it will be an example of Jaynes' complaint about Feller's approach: that Feller, being a clever man, derives most of his results using some beautiful trick or other, and therefore the student concludes that probability theory can only be done by clever people using clever tricks and not by following some basic principles systematically).

Note that this still doesn't answer the question of 6 or more successive heads OR tails. The answer to this is
P(6+ heads) + P(6+ tails) - P(6+ heads AND 6+ tails).
The formula you gave can be used for the first two terms, but not the third.

as said...

Rahul- absolutely correct that it does not provide the answer. I look forward to you blog post.

As to Feller, I have only perused him occasionally; my knowledge of probability is too non axiomatic and heuristic to appreciate him.

Anonymous said...

RS: "But the explanation probably requires a blog post" You mean this blog ?

Rahul Siddharthan said...

OK, I posted my take on this. It turns out that Kovendhan's approach gives a very wrong answer (an overestimate, as Subruk suggested) but he made a couple of errors above which "compensated"...

Ketan said...

Not being a mathematician/statistician, I'd thought this would be very difficult. And no wonder, it is! But coincidentally, was recently taught about binomial & Poisson distributions, and I guess one has to apply the equation for the latter.

But one thing I can point out is, you can only suspect if a given sequence is fake, but never be certain! ;)