In using discrimination tests (like the triangle test) one has to be very careful about how the test is constructed and how the data is interpreted. In a triangle test M panelists are presented triplets of beer samples of which two are the same and one is different. They are asked to determine which of the 3 is the odd beer and N is the number that are able to do so. They are then asked to report how the two beers they have before them compare with respect to some parameter of interest which depends on the reason behind the experiment. The parameter could be very broad or very specific. For example the panelist might be asked which beer the prefers (which is better in his opinion) or whether one beer contains noticeably more diacetyl that the other. Some number, m < M will report preferring (or less diacetyl) in one or the other of the beers. But a smaller number (n < N) of those who 'qualified' in the first test, i.e. were able to correctly identify the odd beer, will find one of the beers to be better (or contain less diacetyl).
The test returns a number, p(M,N,n) which is the probability that a panel whose members hadn't a clue and picked the odd beer by rolling a die (if 1 or 2 comes up choose A, if 3 or 4 choose B, if 5 or 6 choose C) and then determined preference (diacetyl) by tossing a coin) would have produce N or more qualified members out of whom n or more would have preferred one of the beers. This is the probability that more than M-1 panelists qualified and more than n-1 of the qualified picked one of the beers by random chance i.e. without any knowledge of how to taste beer or irrespective or whether one beer was better or contained more diacetyl. If this probability is low enough (typically 1%) we conclude that M,N and n were arrived at by some process other than random chance and that the fraction of preferences, n/N is a valid measure of the goodness or relative diacetyl content of the beers being tested.
Note, and this is very important.
We are testing the panel AND the beer. Just as one cannot measure voltage with an uncalibrated voltmeter one cannot measure diacetly with a panel that contains members who are insensitive to diacetyl (some people just don't taste it).
The panel must be calibrated for the parameter being tested. This is obviously relatively easy to do for a single specific parameter like diacetyl. One takes a beer with known diacetyl level and divides it into two parts one of which is spiked with diacetyl at some level, ∆, determined by the desired sensitivity of the test. A, B an C cups are prepared from these two beers and the test carried out. If a large fraction of qualifying panelists finds the spiked beer to be higher in diacetyl and p < 0.01 we are confident that this panel can perceive diacetyl differences of ∆ or more and can use it to see if, for example, using a particular malt richer in valine than a control malt, improves a particullar beer.
Here we need to interject that is is extremely important to control what he panelist is exposed to. If, in the example of diacetyl, the valine rich malt was also darker in color so that the beer produced from it was darker in color panelists would obviously have no trouble distinguishing the beer containing it as the odd one by color and the test would be invalid. It should be pretty clear that instructing panelists to ignore color isn't going to work so that it would be necessary to mask color in some way such as serving in opaque cups with opaque lids or masking the color with Sinamar.
This is also a good place to mention that panelists should be in comfortable surroundings free from distractions and isolated from one another and that only the data processing people should know which beer is which (i.e. the servers don't). See the ASBC MOA for how to set up a triangle test.
If the question is broader "Which beer is the better beer?" we again must consider the panel. If the two samples were a lager and an ale and we tested with a panel made up of Germans we know what answer we'd get as we would if the panel were 100% Englishmen. Suppose it consisted of 10 of each who, as we are serious about this, were drawn from the quality control staffs of breweries in their respective homelands. I think we could anticipate that they all would qualify and that half would prefer lager. Thus M = 20, N=20, n= 10 and p(20,10,10) = 0.0006 Our data were certainly not arrived at by chance and therefore valid. Lager is not better than ale to a panel composed of equal number of lager preferers and ale preferers. I wouldn't dispute that finding. But they don't tell us anything useful. It would be more meaningful to compose a panel in this case of randomly drawn consumers in a particular market of interest. It is possible, in such a case, that we could obtain a ratio n/N associated with very low probability that the null hypothesis should be accepted (that's what p(M,N,n) is - the null hypothesis is that M,N and n are the results of random guesses) which ratio would show a preference for lager or ale.
That's part of what makes the fact that so many cannot pick the odd one out so interesting to me.
That relates directly to the selection of the panel which must be driven by what one is trying to measure. In the last part of the example above we assumed a panel drawn from the man on the street, presumably of beer buying age. We would expect the number of qualifying panelists to be lower than in the quality control department panel. This isn't necessarily a disaster. Assume we have a panel of 20 such, that only 8 of them qualify but that out of those 7 prefer lager. p(20,8,7) = 0.007 so we can be confident that 78% of this panel prefers lager at the better than 1% confidence level. Thus low qualification level is not a disaster in and of itself but it focuses on the panel. For a test of public acceptance we want a panel that represents the public and must live with low qualification rates but it doubtless would make sense, in this case, to increase the number of panelists. p(40,16,14) = 0.0038 and the estimated preference ratio is still the same but the variance in the estimate is smaller.
Then we get into questions of how well these 20 guys represent the general public (or whatever demographic the experimenter is interested in - presumably home brewers). The test should be repeated and the results combined.
I fully expect that most of the brickbats that are hurled at the Brulosopher experiments would derive from considerations such as these. I doubt they have ever calibrated a panel. I don't see how, with the resources at hand, they could replicate the experiments.
And that's why I'm interested in knowing the conditions under which the tasters did the triangle test. Are there conditions external to the test which are masking results, i.e., palate fatigue, just had a big onion and garlic burger, tasters have had six beers already and are three sheets to the wind, you know the list.
Those are certainly factors that one would expect to influence the qualification rate and I mentioned a couple of other things above.