The MongoDB Pepsi Challenge

Introduction

Back in February, there was a beer & wine tasting at MongoDB. I decided to implement a soda tasting (with some data collection) and piggyback on that. If you’re interested in the culture of MongoDB, this will give you a sense of some of the stuff that goes down at the end of the day from time to time.

I once read (and I think this is coming from Malcolm Gladwell in Blink, but I can’t find the quote) that if you give someone 3 glasses of soda, one of Coke, one of Pepsi, and one that is either Coke or Pepsi, and asked them, not just which are which, but which one is different from the other two, they can’t do it.

I’d been repeating this claim in conversation for several years, and the typical response I got was for the listener to accuse me of not knowing what the hell I was talking about. Apparently, I was repeating a claim that was, literally, unbelievable for a lot of people.

I wanted to know for sure, and this is the story of how I did some science to find out.

Summary of Results

You want to know the punchline, so here it is: MongoDB employees seem to be, on average, random number generators when it comes to determining what the odd soda out is. We were just as bad at figuring out whether a particular glass was Coke or Pepsi (which wasn’t the main scientific question I had, but I added it on since we were doing the science, anyway). Here are the results on individual questions:

Can MongoDB employees determine which of the 3 sodas is the odd one out?

  • Success Rate: 36%
  • Expected Rate from Randomized Guesses: 33%
  • Margin of Error: 17% (at two standard deviations)
  • Conclusion: MongoDB employees are not significantly better or worse than a randomized guessing machine in determining which soda is different from the other two.

Can MongoDB employees determine if the cup they chose was Coke or Pepsi?

  • Success Rate: 62%
  • Expected Rate from Randomized Guesses: 50%
  • Margin of Error: 17% (at two standard deviations)
  • Conclusion: MongoDB employees are not significantly better or worse than a randomized guessing machine at determining if a cup holds Coke or Pepsi.

Can MongoDB employees recognize regular soda?

  • Success Rate: 81%
  • Expected Rate from Randomized Guesses: 50%
  • Margin of Error: 13% (at two standard deviations)
  • Conclusion: MongoDB employees are significantly better than a randomized guessing machine at determining if a cup holds regular or diet soda (though the study didn’t anticipate asking this question and there may be methodological issues).

Were Confident People Better at Any of These Tests?

People were asked if they were confident in their answers. These are the results for the subset that answered yes. But yes, they did significantly better than chance when it comes to recognizing whether their Chosen Cup was Coke or Pepsi

  • 76% +/-21% (two sigma), vs. 50% for random chance
  • In a quirk of statistics, this still wasn’t “significantly better” than the “not confident” group, and the results are at the edge of significance in differentiating itself relative to the “randomized guessing” hypothesis. More testing is recommended before we base any theories on this data.

At recognizing whether it’s diet, they did about as well as the rest of the group.

  • 79% +/- 19% (two sigma) vs. 50% for a randomized guessing machine.

How about the not-confident people?

  • Nothing was significantly different.

Methodology

For those of you who’ve never tried to design an experiment, methodology can easily break an experiment. A well designed experiment will answer your questions about the universe, and a poorly designed one will seem to answer them while really just telling you nothing (if you’re smart and lucky enough to realize it), or else it will tell you something that has more to do with your biases than with the universe, and you won’t even realize that that’s the case. To ensure that I was measuring what I thought I was measuring, I designed the experiment as follows (deviations from plans and known design flaws are also described below):

The drinks were poured in a different room than the room in which they were served.

  • This prevents test subjects from seeing which beverage was poured into which cup.

The person serving the drinks didn’t know which beverage was in which cup.

  • This prevented the researchers from communicating information to the subjects
  • Towards the end of the experiment, this protocol wasn’t followed for the last 4-5 tests because a researcher left.

The person taking the drinks to the server wasn’t the researcher

  • This was a goal, but wasn’t followed due to low headcount.
  • If implemented, it would have been one additional layer preventing information about the beverages from being communicated to the subjects

The contents of each cup were determined by a pseudo-random algorithm.

  • This prevents the researchers from, say, always putting the odd one out in the 3rd cup; if we had done this, and early tasters had let it be known that the third cup is the “odd one out,” this might have biased laters subjects.

We had planned to let the subject choose regular or diet, but instead all subjects were given regular Coke/Pepsi, but were asked if they thought it was diet.

  • This was due to a miscommunication between the server/answer recorder (who asked test subjects and began marking it down), and the soda pourer, who was expecting to get requests for diet soda from time to time, but who never did, because he never told the server/data recorder to send it to him.

The soda consisted of 2-liter plastic bottles of standard Coke and Pepsi, purchased that day from the Times Square Duane Reade.

The cups were glass, and were washed between uses.

  • We had intended to use paper cups, and thought there would be lots available, but we were wrong, so we had to improvise at the last minute.

The test subjects were self-selected, and were recruited with an enticing email [1]. They should not be used as a representative sampling of either MongoDB employees nor of any larger demographic groups.

Subjects were anonymous, and were issued subject_id’s that they could later use to look up their results.

Code quality controls:

  • No unit tests were written
  • The code was written in haste over the course of an hour or two by yours truly. I initially felt moderately confident that I didn’t write in any bugs severe enough to invalidate the results.
  • A brief code review was given, with the results of (and I’m quoting from memory four months later) “Going forward, you might want to try to write code that you have a chance of reading a week from now. Still, it’s never going to be used by our team or by the company for anything that will in any way affect any of our jobs, and I have real work to do, so LGTM.”
  • When running the code, a series of major, crippling bugs became obvious. In each case, I then proceeded to change the script more or less at random until the outputs “looked right” to me.

Safety Procedures:

  • Because this test dealt with human experimentation, it is typical for any researcher to write up a research plan and submit their research plan to an IRB (Institutional Review Board) to ensure that the research is ethical, that it has at least a small chance of producing useful results, and that the test subjects will not be harmed. These tests are mandatory for all institutions (typically universities) that receive federal research grants.
  • This protocol wasn’t even attempted, here. Instead, we fed human research subjects a substance associated with weight gain in order to satisfy our own curiosity about whether or not they could tell which harmful substance they were ingesting.

Actual procedure:

  • In one room, the soda was poured (around 2 ounces per cup, but we did not measure it precisely), and placed on a plate with an experiment_id.
  • The cups were placed on one of 3 circles on the plate, labelled “A”, “B”, or “C”. At least one (and at most two) was Pepsi, and at least one (and at most two) was Coke, and they all contained High Fructose Corn Syrup.
  • The plate was then taken to the server/data recorder.
  • A subject who wished to take the test was issued a subject_id.
  • At the beginning of the trial, the server / data recorder would record the subject_id in the appropriate place for the experiment_id, and the subject was issued their plate.
  • The subject would taste the beverages in any order desired, at any speed desired.

When the subject was ready, the data recorder would ask the subject, and record their answers for the following:

  • Which cup was unique? - In other words, which of the 3 cups contained the odd beverage out? In other words, which cup contained the beverage that was only in that cup and not in any other cup?
  • What was in that unique cup? - Was it Coke? Or was it pepsi?
  • Are you confident in your answer? - Yes or No
  • Was it diet soda? - Yes or No; remember, all soda was regular, and no-one was given diet soda.

Experimenter’s Field Notes

In this section, I will discuss things that happened, and subjective experiences, and tips for anyone trying something similar.

First, if you plan to try this, make sure you have all materials. My understanding, going in, was that we had cups, but we didn’t, and this alone nearly doomed this experiment. Salvaging the experiment involved making human test subjects wash dishes after each trial.

Second, have at least 3 people making beverages, and at least 2 people taking data. This is a very labor intensive experiment.

Third, there was a human element I didn’t anticipate: most test subjects wanted to know, immediately after taking the test, if they were “right”. A single test run isn’t sufficient to determine if they know what they were doing (even a random answer would be correct one time in 3), but people still wanted to know. I didn’t release that data until the test was done (and they had to hold onto their test_id number), but I wish I’d given the test subjects the ability to learn the answer immediately.

Fourth, some test subjects complained that the test was “too easy,” because the carbonation levels of the sodas weren’t the same, making it obvious which soda was the “odd one out.” Therefore, as hard as this is to believe, our experiment may actually overestimate the ability of MongoDB employees to tell one cola from another.

Finally, the experimenters noticed that LOTS of people were extremely confident in their ability to perform well on this test before they took their first sip, but very unsure after their 3+ sips. It was immediately clear to the test subjects that this test was much harder than they thought it would be.

Data

I’ve put the results in CSV format; see 3 charts, attached. [2] I’ve also attached pictures of our raw data, on paper, [3].

In no case can we make claims about individuals’ abilities to determine the type of soda. We just didn’t have enough data from any individual to say with any statistical confidence that they did better than chance.

Those who were highly confident about their choices actually did slightly worse about choosing the “odd one out,” and slightly better at determining whether their choice was Coke or Pepsi, but not in a way that was statistically different from chance, partly because our error bars were pretty big.

Our data set was 32 people, with 36 test runs, so most people took only one test. Because of the small number of test subjects, even two people failing the test would have invalidated our ability to say, with p < 0.05, that we can tell coke from pepsi. 3 people failing to tell the odd drink out would have invalidated our ability to determine if we were better than chance at p < 0.05. In neither case did we get even close to that level of precision. Unfortunately, we didn’t have nearly enough data to make solid claims, except that this test is much harder than people thought it would be.

Conclusion

The one thing we can say with certainty is that the task (determining which of 3 cups of coke or pepsi was different) is difficult. We didn’t have enough data to say almost anything definitively. First, only one individual took the test the minimum number of times (3) to say whether or not they personally could determine the odd beverage out. Second, collectively, we did very close to chance. Third, confidence on the part of the test subject may have either helped or hurt, but our analysis doesn’t reveal which.

Everyone I spoke to about this had a lot of fun. There was a TON of interest, which is how I ended up writing this up in a blog post. When I ran into problems, people stepped up to help. When looking for research subjects, tons of people were jazzed to (1) show off their tasting abilities, and (2) do it for SCIENCE. When the results came out, that started a whole new conversation, and lots of people, from our interns to the board of directors weighed in.

If you’d like to be a test subject the next time we do something like this, you can apply below:

Work at MongoDB

Acknowledgments

First, I’d like to thank our own Richard Kreuter and David Percy, the former for serving drinks & recording data, the latter for assisting in preparing drinks when it became apparent that we were understaffed. I’d also like to thank Maria Hecheverry, who totally cleaned up my mess when I was doing some data analysis at the end of the evening, so that I came back to take care of things, and found that all of the cleaning had already been done.

About the Author - William (Will) Cross, Ph.D
Will is a curriculum developer on the Education team. Among other things, he:

  • Maintains and updates the online courses at MongoDB University
  • Builds certification exams
  • Leads trainings from time to time

His background is in physics; he spent his twenties measuring gravity, most recently publishing a value of G. He has also taught physics at Science Park High School in Newark, NJ.

Notes

[1] Email sent to MongoDB Employees to recruit test subjects

Human Test Subjects, One Minion Needed for Human Experimentation

Summary: I want to perform taste tests to determine who can truly tell Coke from Pepsi (not necessarily which is which) by taste. I will be serving both, in small cups, to people who come by. I require at least one minion laboratory assistant (for reasons stated below), plus as many human test subjects soda drinkers as are sufficiently indifferent to their own health as to be interested in taking the test.

Time: tomorrow night (February 5) from 6 PM - 7PM

Venue: MongoDB's New York Headquarters at the Craft Beer & Artisanal Soda Tasting

Experimental set-up:

I will be in a conference room pouring soda into small cups. This will be out of sight of all test-takers and my most senior lab assistant, and I will be labeling the cups "A," "B," and "C," and pouring Coke and Pepsi into them using a completely random pseudo-random method. I will be recording an ObjectId() a sample id for each set of three, along with which beverage is in which container. When all of this is recorded, my lab assistant will then bring the beverage to the taster. The lab assistant will then record the taster's name, choice of which overpriced sugar water tasty beverage is different, and, if the taster is feeling particularly deserving of humiliation confident, whether it is Coke or Pepsi. You will also be asked if you are sure, so that we can determine if you are a poster child for the Dunning-Kruger effect the test subjects' aggregate degree of confidence is associated with greater success or not.

Answers will be revealed at 7 PM. Tasters will have to wait to learn of their degree of humiliating ignorance success rate.

Scientifically demonstrable knowledge for you, personally will be defined as: You took 3 tests that were later all determined to be successfully carried out, and then stopped. You took more than 3 tests, maybe got some wrong, but achieved an overall p-value of < 0.05.

I will also look at some aggregate statistics and see what I can find about the quality of our co-workers.

-Will

[2] CSV results
[3] View the raw data here and here.