Do "professional" brewers consider brulosophy to be a load of bs?

Homebrew Talk - Beer, Wine, Mead, & Cider Brewing Discussion Forum

Help Support Homebrew Talk - Beer, Wine, Mead, & Cider Brewing Discussion Forum:

This site may earn a commission from merchant affiliate links, including eBay, Amazon, and others.
Status
Not open for further replies.
I'm not knocking *anyone* who happens to be a BJCP judge, but I'm not real impressed by the quality and consistency of them, either.

Case in point: Last beer comp I entered, I sent in a Foreign Extra Stout with my other entries. A buddy of mine paid for an entry but didn't have any finished beer to submit, so I gave him two bottles of *the exact same beer.*

One came back with a score in the low 30's with notes of 'too much caramel, mouthfeel too light.'

The other came back with a low 40's score and no significant flaws noted.

These were the *same* beers, in the *same* flight, by the *same* judges.

More or less killed my desire to enter competitions and renders the feedback to be horribly suspect.

Good post and I suppose answers a question I've been kicking around...why don't the brulosophers submit samples to competitions and add the scores to the discussion. The term temp experiments is good example. be interesting if the guided tester...the bjcp judge with style guide...might rate the beers differently. Would not validate or invalidate the primary finding but might add additional anecdotal information of interest.
 
This comparison isn't valid because 4 out of 10 preferring A is the same as 6 out of 10 preferring B. In fact p(20,10,4) isn't meaningful because of this symmetry and I should have realized that and I can assure you that these discussions have taught me quite a bit beyond just that.

I need to think more about how I'm computing these probabilities. This is turning into a tar baby.

Yeah, it is. There are a number of things that make this problem a....problem.

One is what those panels of beer drinkers represent. I understand the statistics (believe me, I do), but I don't think they're properly used to produce actionable intelligence. I use that phrase with my students; so something is "significant;" what have you learned of value about the world if it's significant? If you can't say, significance is not useful.

I know that people are guessing when they can't tell, and that's certainly fine for the statistical element of this, but there's an issue with it. People who guessed correctly simply by luck can't tell the difference. I don't see the point of asking such people about preference, as the preference is just as random. When one does that, the preference data is contaminated by guessing. Like trying to see something through fog.

I'd feel better about the panels--and those who "qualified"--if they could reproduce their choice repeatedly. That would tell me they truly were qualified, and not qualified purely on the basis of a lucky guess.

One of my areas of interest/expertise is measurement (though it's in the social science world, not the biological/chemical world). Measures--instruments--need to be valid but to be valid they also must be reliable. I have no indication in any of this that the guessers--who are providing "preference" data--are doing anything in that realm except guessing.

And to a guy in my field, "guessing" is the antithesis of reliability. Without reliability you cannot have validity--and it's very hard for me to see either here.
 
Good post and I suppose answers a question I've been kicking around...why don't the brulosophers submit samples to competitions and add the scores to the discussion. The term temp experiments is good example. be interesting if the guided tester...the bjcp judge with style guide...might rate the beers differently. Would not validate or invalidate the primary finding but might add additional anecdotal information of interest.

He has! His warm ferment vienna lager went to round two or something at nhbc. It scored a 41 in the first round. Further note, all the podcasts I've heard where he serves beer to people, the people, often famous in hb, are very complimentary of his beer. He brews a lot of beer, I am sure its good.
 
He has! His warm ferment vienna lager went to round two or something at nhbc. Further note, all the podcasts I've heard where he serves beer to people, the people, often famous in hb, are very complimentary of his beer. He brews a lot of beer, I am sure its good.

Confirmation bias?

What would they think of the beer if they didn't know he'd brewed it?

I'm not saying it's bad, not at all. I'm just noting there are other explanations.
 
Confirmation bias?

What would they think of the beer if they didn't know he'd brewed it?

I'm not saying it's bad, not at all. I'm just noting there are other explanations.

Maybe that's why he doesn't Brew all the experiments. There are a few brulosophers now. You had already quoted me before I added that it scored a 41. I don't know how competitions are scored but I have no reason to believe that this man's beers aren't great.
 
I'm not knocking *anyone* who happens to be a BJCP judge, but I'm not real impressed by the quality and consistency of them, either.

Case in point: Last beer comp I entered, I sent in a Foreign Extra Stout with my other entries. A buddy of mine paid for an entry but didn't have any finished beer to submit, so I gave him two bottles of *the exact same beer.*

One came back with a score in the low 30's with notes of 'too much caramel, mouthfeel too light.'

The other came back with a low 40's score and no significant flaws noted.

These were the *same* beers, in the *same* flight, by the *same* judges.

More or less killed my desire to enter competitions and renders the feedback to be horribly suspect.

Have you ever poured a glass of beer and taken a couple sips and thought it was good/bad, only then to change your opinion by the time you finished the beer? And that's when you know it's the same beer, and have not tasted 5 other beers in the meantime.

Hoepfully, with multiple judges, you're getting feedback that's in the ballpark most of the time. I find it helpful to confirm what I taste, or to give me some ideas as to what I may be missing. Sometimes it's not helpful. With something as subjective as taste, you can't really hope for much more than that. The fact that some people score consistently well tells me it's not a worthless endeavor.
 
He has! His warm ferment vienna lager went to round two or something at nhbc. It scored a 41 in the first round. Further note, all the podcasts I've heard where he serves beer to people, the people, often famous in hb, are very complimentary of his beer. He brews a lot of beer, I am sure its good.


Yes but the question isn't if it's good or great beer. It's about if doing X makes it different. "Better" is in the eye of the beholder
 
Yes but the question isn't if it's good or great beer. It's about if doing X makes it different. "Better" is in the eye of the beholder

Actually no...I was thinking to submit the beer to a competition as a test of whether it was actually a good beer or something mediocre. If the testers could not tell the beers apart and both scored in the high thirties I'm thinking differently about the outcome than if both scored in the 20s. The warm ferment is good example but I'd like to see him submit both beers to same comp in same category and see how they do. Maybe that tinge of diacetyl doesn't show badly in triangle test with no information provided about the style or the test but does stand out in a lineup of 10 beers intended to be judged against the style.
 
I liked this post thinking I understood your point. Then I realized I did not know what pedagogy meant and had to look it up. Not sure brewing pedagogy makes sense now that I did but thanks for the $10 word!
IMO ingrained or deep-rooted or fixed beliefs would be better descriptors.
 
Yes but the question isn't if it's good or great beer. It's about if doing X makes it different. "Better" is in the eye of the beholder

This is one of my main points in all these discussions. Just because there is a difference, it doesnt represent better or worse, just a difference. Difference or not, the dude can brew very well and everyone has complimented him. He likes, makes, and drinks good beer. I wonder if some haven't read or heard much from him. He is more shocked than anyone and brews by and believes in standard practices for sure.
 
This is one of my main points in all these discussions. Just because there is a difference, it doesnt represent better or worse, just a difference. Difference or not, the dude can brew very well and everyone has complimented him. He likes, makes, and drinks good beer. I wonder if some haven't read or heard much from him. He is more shocked than anyone and brews by and believes in standard practices for sure.

I noticed that when he did a 60 minute vs. 30 minute mash. There was no difference in the outcome, but he still does a 60 minute mash.

So he is innovative, and he's come up with some information we didn't have before. Especially stuff that runs counter to the prevailing ways of doing things. That makes it interesting.

But I also wonder if people can't hold to the level of precision he can, are his findings less useful. I mean, I am for 152 for a mash temp, and that could be 150-154 sometimes. He does two mashes side by side and they are 150.2 and 150.0. And I assume he was aiming for 150.

So the next step is replication - can others replicate his experiments and results.

As others have mentioned, his articles aren't peer reviewed or vetted. He runs experiments (once as far as I can tell) and publishes the results. If you compare that to actual lab work, it is very lightweight.
 
This touches on another aspect that didn't get much (if any) mention in my previous post. The result, of course, depends on the panel but they also depend on the instructions given to the panel. Where the instructions call for marking of one or the other of the beers based on opinion rather than something more concrete (such as whether one beer tastes more strongly of diacetyl than the other) we have quite (IMO) a different situation. I know (or think I know) how to calibrate a panel to see if it can detect diacetyl but I don't know how to calibrate one to see if it can detect 'better' beer.


But Brulosphy clearly states that they are just looking for the tasters opinion in this round of questions.
And they only asks their opinions if they correctly chose the odd beer out.

This isn't actually part of the test.
 
But Brulosphy clearly states that they are just looking for the tasters opinion in this round of questions.
And they only asks their opinions if they correctly chose the odd beer out.

This isn't actually part of the test.

That's a really important thing to understand about what they do. They mostly test processes and some ingredients to see if there is a difference.

They aren't really even asking "is there more diacetyl?" They're just saying, "Here are 3 glasses. Two are the same as each other, and one is different from those two. Which is different?" And from there, they discuss which they like better.
 
Have you ever poured a glass of beer and taken a couple sips and thought it was good/bad, only then to change your opinion by the time you finished the beer? And that's when you know it's the same beer, and have not tasted 5 other beers in the meantime.

Hoepfully, with multiple judges, you're getting feedback that's in the ballpark most of the time. I find it helpful to confirm what I taste, or to give me some ideas as to what I may be missing. Sometimes it's not helpful. With something as subjective as taste, you can't really hope for much more than that. The fact that some people score consistently well tells me it's not a worthless endeavor.

Additionally, two beers on opposite ends of the flight can show significant palate fatigue, as well as a scoring shift relative to other beers in the flight. I try to teach newer judges to account for that- there needs to be reason(s) I am scoring a beer better or worse than the beers that preceded it. Sometimes it can mean adjusting scores of current beer. Sometimes it can mean adjusting previous scores.

While that's only marginally relevant to the topic at hand, the palate issues are very relevant.

And at the end of the day, statistics aside, if Brulosophy experiments cannot be replicated then they are fundamentally worthless and there is some variable or combination of variables, known or unknown, that are not being accounted for.

Some folks really need to hop off the bandwagon, stop misquoting people and putting words in others mouths, and actually listen to other people.
 
i thoroughly enjoy their site but would never consider it rigorous scientific method. but isn't that kind of the whole point? in a lab setting and with proper analytical techniques, no doubt it could be determined and demonstrated that beers are different...but we don't exist in a lab. our bodies don't have the sensitivity of advanced lab equipment so does it even matter? if anything, they are just reinforcing what papazian has been saying for decades (rdwhahb).

i like the exbeeriments also reinforce human nature. like the one where they split a blonde ale and added flavorless, odorless colorant to half the batch and had folks compare (they could see the color difference). sure enough, folks described traditional characteristics when comparing the dark and light versions, even though they were identical. they categorized the light beer as a cream ale, pale ale, light lager, etc. and the dark as a dark lager, brown ale, porter, etc. some testers were served both samples blindfolded and not surprisingly, they couldn't tell the difference. just goes to show the power of our senses and associated preconceived biases. they had another one with an ipa up against pliny. folks thought the pliny tasted pretty good but once they were told it was pliny, they couldn't get enough of it. again, the power of persuasion.

i also like the ones where they take the tests themselves and can't tell the difference. they know the variable, know what to hunt for and still can't tell them apart. yes, yes, perception is in the eye of the beholder and it is just one person but still pretty interesting...
 
Have you ever poured a glass of beer and taken a couple sips and thought it was good/bad, only then to change your opinion by the time you finished the beer? And that's when you know it's the same beer, and have not tasted 5 other beers in the meantime.

Hoepfully, with multiple judges, you're getting feedback that's in the ballpark most of the time. I find it helpful to confirm what I taste, or to give me some ideas as to what I may be missing. Sometimes it's not helpful. With something as subjective as taste, you can't really hope for much more than that. The fact that some people score consistently well tells me it's not a worthless endeavor.

I've run into this inconsistency enough times to make me believe it's more widespread than believed. One would hope, that if you were actually judging beers, that you'd try to do it with consistency. I know palate fatigue is a real thing but in ideal circumstances the person doing the judging should be aware of it and make allowances for it. To have the same beer, bottled on the same day in the same way have such a disparity in results just tells me that the method of measurement (the judges) is faulty. I'm not saying *all* judges palates are flawed, I'm just saying that maybe there should be a bit more rigor in how judges are selected and ranked.
 
I noticed that when he did a 60 minute vs. 30 minute mash. There was no difference in the outcome, but he still does a 60 minute mash.

So he is innovative, and he's come up with some information we didn't have before. Especially stuff that runs counter to the prevailing ways of doing things. That makes it interesting.

But I also wonder if people can't hold to the level of precision he can, are his findings less useful. I mean, I am for 152 for a mash temp, and that could be 150-154 sometimes. He does two mashes side by side and they are 150.2 and 150.0. And I assume he was aiming for 150.

So the next step is replication - can others replicate his experiments and results.

As others have mentioned, his articles aren't peer reviewed or vetted. He runs experiments (once as far as I can tell) and publishes the results. If you compare that to actual lab work, it is very lightweight.

Others have! There are 3 or 4 of them. The way i see it, lightweight, or not, its what we have. On the fermentation temperature reproach thread, I asked time and time again for other research, other data, and you know how much showed up. Zero, zip, zilch. We would all be willing to consider any other data, where is it. Btw, he has used labs in a few and the dms cam back nil on a 30 min boil of German pilsner. Hot side aeration, dms from short boil, lid off, mash temp, fermentation temp, autolysis, lot to consider.
 
I've run into this inconsistency enough times to make me believe it's more widespread than believed. One would hope, that if you were actually judging beers, that you'd try to do it with consistency. I know palate fatigue is a real thing but in ideal circumstances the person doing the judging should be aware of it and make allowances for it. To have the same beer, bottled on the same day in the same way have such a disparity in results just tells me that the method of measurement (the judges) is faulty. I'm not saying *all* judges palates are flawed, I'm just saying that maybe there should be a bit more rigor in how judges are selected and ranked.

People do overestimate the difficulty of attaining BJCP Recognized or Certified rank. The bar is lower than many think. However the leap between Certified and National is huge, and the leap between National and Master/GM is even larger. My experience over years of entering has confirmed that too.

Judges get biases too. Newer judges often think they're super-tasters and hunt down imaginary off flavors. Higher ranked seasoned judges get arrogant and let their preconceived notions of style bias them. However I am more inclined to trust their palates.

I'd be curious to see the two sets of scoresheets in question and any other info (if flight size/order were marked on the cover sheet).

BJCP ranking, especially Recognized or Certified, isn't strong enough to make me trust their palate on its own in most contexts.
 
If you think these concepts are going to make better beer you couldn't be further from the mark, imo.

So to the short & shoddy experiment, you're saying that a longer mash, longer boil, pitching a proper amount of yeast and controlling fermentation temperature *won't* make better beer than a short mash, short boil, under-pitched yeast and no temp control?

I'm not sure which of those elements is most important among the four, but the experiment clearly showed both statistically significant differences between the two batches AND a preference for the "traditional" beer.
 
i thoroughly enjoy their site but would never consider it rigorous scientific method. but isn't that kind of the whole point? in a lab setting and with proper analytical techniques, no doubt it could be determined and demonstrated that beers are different...but we don't exist in a lab. our bodies don't have the sensitivity of advanced lab equipment so does it even matter? if anything, they are just reinforcing what papazian has been saying for decades (rdwhahb).

i like the exbeeriments also reinforce human nature. like the one where they split a blonde ale and added flavorless, odorless colorant to half the batch and had folks compare (they could see the color difference). sure enough, folks described traditional characteristics when comparing the dark and light versions, even though they were identical. they categorized the light beer as a cream ale, pale ale, light lager, etc. and the dark as a dark lager, brown ale, porter, etc. some testers were served both samples blindfolded and not surprisingly, they couldn't tell the difference. just goes to show the power of our senses and associated preconceived biases. they had another one with an ipa up against pliny. folks thought the pliny tasted pretty good but once they were told it was pliny, they couldn't get enough of it. again, the power of persuasion.

i also like the ones where they take the tests themselves and can't tell the difference. they know the variable, know what to hunt for and still can't tell them apart. yes, yes, perception is in the eye of the beholder and it is just one person but still pretty interesting...

Wow, thanks for this insight. Yeah thinking of the way we Homebrew in garages juxtaposed with a lab says a lot. I couldnt agree more and perception is king. I try vvveeeerrryyyy hard to keep my perception out of things, even though its hard.
 
So to the short & shoddy experiment, you're saying that a longer mash, longer boil, pitching a proper amount of yeast and controlling fermentation temperature *won't* make better beer than a short mash, short boil, under-pitched yeast and no temp control?

I'm not sure which of those elements is most important among the four, but the experiment clearly showed both statistically significant differences between the two batches AND a preference for the "traditional" beer.

Yes, i am, especially not in the terms of better in the way you're describing it. Definitely not in the terms of great the way you want to make great beer. I actually hadn't seen this one I don't think, only heard the original which was a podcast. You skewed the numbers a little and once again totally failed on picking up on the qualitative and empirical data. Yeah of 22 Taster's 13 could tell a difference. It reached a level of confidence but it wasn't like all 22 of them could. Then you quote 6 verse 2 in preference. Well the other five couldn't tell a difference or didn't have a preference. That means seven either liked the short one, or didn't care which one. No one described the short one as bad and the person who made it themself was startled at how similar they were. Now I make beers in two and a half hours, your definition of better and mine might be a little different. Either way if one was so much unbelievably better than the other more than 13 would have seen the difference, the person who brewed it would have said it was way better, and many more of the seven of the 13, the majority, would have preferred it. Imo, this is not the road to beer making Nirvana, and if you want to split hairs at this level to be right then you can be right.
 
Others have! There are 3 or 4 of them. The way i see it, lightweight (your opinion), or not, its what we have. On the fermentation temperature reproach thread, I asked time and time again for other research, other data, and you know how much showed up. Zero, zip, zilch. We would all be willing to consider any other data, where is it. Btw, he has used labs in a few and the dms cam back nil on a 30 min boil of German pilsner. Hot side aeration, dms from short boil, lid off, mash temp, fermentation temp, autolysis, lot to consider.

But I think you're wrong on fermentation temp, based on the fact that it has been replicated so many times. Of his ferm temp experiments, 7 of 8 were deliberately testing warm vs cool.

1) I already showed the results. Notable was that if the answers were purely arrived at by guessing in a triangle test, it would be assumed that in some experiments there would be <33% of the testers picking the odd beer. I noted that the worst case result was 33%, but every other test was higher than 33% picking correctly. There was a trend, and that trend was CONSISTENTLY one direction. The number of testers in the experiments, however, would have required typically 50% to achieve significance.

2) He showed that temperature gap matters. In experiment 5, he deliberately skewed the temp all the way up to 82 degrees. This was one of the experiments that achieved significance with 21 testers.

3) If you take those 7 experiments in the aggregate, 76 of 172 testers correctly picked the odd beer. Statistically, that's a p-value of 0.002. That is not only significant, that is HIGHLY significant.

And you can't accuse me of cherry-picking, as the other experiment (#7) actually achieved significance with 50% of the testers correctly choosing the odd beer. So if I had included it in my aggregate analysis, it would have strengthened the result. But I can't do that as it was testing a different variable than simply colder vs warmer ferment.

So point #1 above suggests that the trend is one direction, indicating that even if it doesn't achieve significance, there's no reason to believe testers are always "guessing" in one direction.

Point #2 shows that magnitude matters. When you increase the magnitude of temperature difference, it's easier to achieve significance. Perhaps when you're a very good brewer, slight changes in fermentation temperature create a difference but it's below many people's tasting threshold. Increasing the magnitude of the difference brings it within more people's tasting threshold.

Point #3 shows that with a larger samples size (yes, constructed from separate experiments), a bunch of non-significant results achieve signficance. In many ways this is simply a more statistically valid restatement of point #1, as if the errors were both above and below 33%, it wouldn't be likely to achieve significance.

--------------------------------------

Have I made an error here? You seem to have taken it at face value that fermentation temp is not worth worrying about. I would think that the above description might convince you that although individual experiments didn't achieve significance, there is still a difference between the two.
 
Getting back on topic, my book by Mike karnowsky, who is a monster Brewer in the field, has quite a few little experiments like brulosophy that he did and reports on. Also on a pro forum I saw discussion on various topics. So I think there's reason to believe that some professionals like to experiment and find other ways of Brewing.
 
But I think you're wrong on fermentation temp, based on the fact that it has been replicated so many times. Of his ferm temp experiments, 7 of 8 were deliberately testing warm vs cool.

1) I already showed the results. Notable was that if the answers were purely arrived at by guessing in a triangle test, it would be assumed that in some experiments there would be <33% of the testers picking the odd beer. I noted that the worst case result was 33%, but every other test was higher than 33% picking correctly. There was a trend, and that trend was CONSISTENTLY one direction. The number of testers in the experiments, however, would have required typically 50% to achieve significance.

2) He showed that temperature gap matters. In experiment 5, he deliberately skewed the temp all the way up to 82 degrees. This was one of the experiments that achieved significance with 21 testers.

3) If you take those 7 experiments in the aggregate, 76 of 172 testers correctly picked the odd beer. Statistically, that's a p-value of 0.002. That is not only significant, that is HIGHLY significant.

And you can't accuse me of cherry-picking, as the other experiment (#7) actually achieved significance with 50% of the testers correctly choosing the odd beer. So if I had included it in my aggregate analysis, it would have strengthened the result. But I can't do that as it was testing a different variable than simply colder vs warmer ferment.

So point #1 above suggests that the trend is one direction, indicating that even if it doesn't achieve significance, there's no reason to believe testers are always "guessing" in one direction.

Point #2 shows that magnitude matters. When you increase the magnitude of temperature difference, it's easier to achieve significance. Perhaps when you're a very good brewer, slight changes in fermentation temperature create a difference but it's below many people's tasting threshold. Increasing the magnitude of the difference brings it within more people's tasting threshold.

Point #3 shows that with a larger samples size (yes, constructed from separate experiments), a bunch of non-significant results achieve signficance. In many ways this is simply a more statistically valid restatement of point #1, as if the errors were both above and below 33%, it wouldn't be likely to achieve significance.

--------------------------------------

Have I made an error here? You seem to have taken it at face value that fermentation temp is not worth worrying about. I would think that the above description might convince you that although individual experiments didn't achieve significance, there is still a difference between the two.


The data speaks for itself and you are free to see it how you want. Once again i feel you left out the empirical data. And brought up an experiment that goes against your arguments. The way i see the 82 degree ferment xbmt srtrengthens my argument. Yep, you are right it showed significance at 82 degrees ferment. Who ferments at 82 anyways, but he did. But see what you're missing is that it only shows there was a difference. Using your perception and deep-rooted beliefs, you assume that difference was bad. However seven preferred the warm fermented one, to two cool. And if you include the other four who didn't care either way it's still 7 to 6. That's not enough for me to go around making claims about warm fermenting. And if I was going to, I guess I would have to say warm ferment is better as six of the eight tests didn't even show a difference, the 82 deg xbmt showed a difference with preference on warm, and not having to buy or have a bunch of junk in my house is a no brainer. But I won't go saying that warm ferment is better, but I will say this isn't the road to beer making Nirvana, imo.
 
Additionally, two beers on opposite ends of the flight can show significant palate fatigue, as well as a scoring shift relative to other beers in the flight. I try to teach newer judges to account for that- there needs to be reason(s) I am scoring a beer better or worse than the beers that preceded it. Sometimes it can mean adjusting scores of current beer. Sometimes it can mean adjusting previous scores.

While that's only marginally relevant to the topic at hand, the palate issues are very relevant.

And at the end of the day, statistics aside, if Brulosophy experiments cannot be replicated then they are fundamentally worthless and there is some variable or combination of variables, known or unknown, that are not being accounted for.

Some folks really need to hop off the bandwagon, stop misquoting people and putting words in others mouths, and actually listen to other people.

^This!
 
1) I already showed the results. Notable was that if the answers were purely arrived at by guessing in a triangle test, it would be assumed that in some experiments there would be <33% of the testers picking the odd beer. I noted that the worst case result was 33%, but every other test was higher than 33% picking correctly. There was a trend, and that trend was CONSISTENTLY one direction. The number of testers in the experiments, however, would have required typically 50% to achieve significance.

This is a beautiful example of someone who is really thinking about what this all means. I really mean that!

It's a wonderful "Hmmmm...." moment. I still have my issues with how panels are constituted and whether there's palate fatigue or what prior drinking/eating does to panelists, but this is interesting.


3) If you take those 7 experiments in the aggregate, 76 of 172 testers correctly picked the odd beer. Statistically, that's a p-value of 0.002. That is not only significant, that is HIGHLY significant.

A meta-analysis! Nicely done. Of course, they're different, but what remains is whether one is preferable to the other.

And you can't accuse me of cherry-picking, as the other experiment (#7) actually achieved significance with 50% of the testers correctly choosing the odd beer. So if I had included it in my aggregate analysis, it would have strengthened the result. But I can't do that as it was testing a different variable than simply colder vs warmer ferment.

I'm half tempted to use this in my classes; good thinking here, and a great way to show tilting the evidence away from a particular outcome and still achieving it.

So point #1 above suggests that the trend is one direction, indicating that even if it doesn't achieve significance, there's no reason to believe testers are always "guessing" in one direction.

Point #2 shows that magnitude matters. When you increase the magnitude of temperature difference, it's easier to achieve significance. Perhaps when you're a very good brewer, slight changes in fermentation temperature create a difference but it's below many people's tasting threshold. Increasing the magnitude of the difference brings it within more people's tasting threshold.

Well, I'd say that point #2 "suggests" something....there's still a lot of looseness in how people do these tests.

Point #3 shows that with a larger samples size (yes, constructed from separate experiments), a bunch of non-significant results achieve signficance. In many ways this is simply a more statistically valid restatement of point #1, as if the errors were both above and below 33%, it wouldn't be likely to achieve significance.

The only issue with this approach is that all of the samples did different processes/recipes. I do a coin-flip exercise in my classes--20 students flip a coin, we record the number of heads. We do it again and again, 10 times. Of course, the number of heads oscillates around 10/20, and I ask the students, what would it look like with a sample of 200? Of course, 10 trials of flipping 20 coins is a sample of 200.

But the parameters in flipping coins don't change; the parameters in the 7 exbeeriments did.

--------------------------------------

Have I made an error here? You seem to have taken it at face value that fermentation temp is not worth worrying about. I would think that the above description might convince you that although individual experiments didn't achieve significance, there is still a difference between the two.

I think your approach is more valuable than most above. I still have my issues w/ the panels (who here doesn't get that by now :)), but I find this to be better "out of the box" thinking than a blind statistical analysis.

Bravo!
 
One is what those panels of beer drinkers represent. I understand the statistics (believe me, I do),
I don't doubt that you understand them far better than I.

but I don't think they're properly used to produce actionable intelligence. I use that phrase with my students; so something is "significant;" what have you learned of value about the world if it's significant? If you can't say, significance is not useful.

So lets say a brewer thinks he has a diacetyl problem and wants to know if using a proportion of valine rich malt will improve his beer with respect to diacetyl. He brews a test batch (B) and wants to know if it is better than his regular beer (A) which is the same as B except that B contains some portion of the valine rich malt. To see if it's better he gives 40 tasters a sample of each and 17 report that beer B is better. He goes to a table (or whatever) and finds that the probability that 18 or more tasters guessing randomly prefer B is 78.5%. He concludes that as less than half of his tasters preferred B and as it's more likely than not that the data he obtained could be obtained by flipping a coin that B is very likely not better than A and he doesn't adopt the new process. He takes no action. Let's assume, at this point that indeed the new malt does improve the beer by reducing diacetyl but that 22 members of the panel are diacetyl taste deficient. Thus the brewer accepts H0 when H1 is true and we see that this test isn't very powerful because of the panel composition.

Along comes a guy who says "Hey, that's not a very powerful test. Give them three cups...." i.e. advises him to try a triangle test. Under H1 the 18 that picked the lower diacetyl beer in the simple test should be able to detect the difference between A and B and so we would have 18 out of 40 qualifying. The probability of this happening under H0 is 8.3%. That's enough for the brewer to start to think 'maybe this makes a difference' but not below the first level of what is usually statistically significant. And he still doesn't know whether the new process improves the beer. Being this close to statistical significance his action is perhaps to perform additional tests or empanel larger panels or test his panel to see if some of its members are diacetyl sensitive.

The consultant comes back and says "Did you ask the panelists which they preferred?" and the brewer says "Yes but I didn't do anything with the data because this a triangle test." The consultant advises him to process the preference votes which reveals that 11 of the 18 who qualified preferred B. The probability that 18 qualify and 11 prefer under the null hypothesis is 1.6%. Using this data the brewer realizes he is below statistical significance threshold, confidently rejects the null hypothesis and takes the action of adopting the new process. Note that under the assumptions we made above more than 11 out of 18 should find B to be lower in diacetyl. If 14 do then then p < 0.1%


I know that people are guessing when they can't tell, and that's certainly fine for the statistical element of this, but there's an issue with it.
You seem to be saying that while we are trying to make a decision about H1 by the only means available to us i.e. rejecting H0 if the probability of what we observe is low under H0, that a test which produces a lower p than another test isn't necessarily a better test. The lower p the more likely we are to reject H0 when H1 is true (and p does not depend on any of the conditions that pertain when H1 is true) and the probability that we do so is, AFAIK (I'm no statistician for sure), the definition of the 'statistical power' of the test. The two stage test is more powerful than the triangle alone test.


People who guessed correctly simply by luck can't tell the difference. I don't see the point of asking such people about preference, as the preference is just as random.
That's really not a flaw of the technique but rather a feature of it. Yes some unqualified votes (guesses) come in but 2/3 of them are eliminated. Compare to just asking panelists to pick the better beer. 0% of the guessers are eliminated in that case. The power of the two stage triangle test derives from this very feature.


When one does that, the preference data is contaminated by guessing. Like trying to see something through fog.
So let's turn down the contamination level by presenting quadruplets of cups with 3 beers the same and 1 different. In that case only 1/4 of guessers qualify, p(40,18,11) = 0.09% and the test is seen to be even more powerful.


I'd feel better about the panels--and those who "qualified"--if they could reproduce their choice repeatedly.
As I mentioned in a previous post adding the preference part is really asking the panelists to distinguish the beers again by choosing which has less diacetyl than the other. This is sort of similar to multiple runs.

That would tell me they truly were qualified, and not qualified purely on the basis of a lucky guess.
Depending on the nature of the investigation qualification by guessing may be exactly what you are looking for. If you want to see if your market is insensitive to diacetyl creep then you want to see if they have to guess when beings asked to distinguish (or, more important, prefer) beers lower in diacetyl. Keep in mind that to pick one out of three correctly there must be both a discernable difference AND the panelist must be able to detect it. If both those conditions are not met then every panelist must guess (the instructions require him to). These tests are a test of the panel and the beer. I keep saying that.

But where we are, as in the example of this post, investigating something specific we want to qualify our panel by presenting it standard samples for two part triangle testing.


One of my areas of interest/expertise is measurement (though it's in the social science world, not the biological/chemical world). Measures--instruments--need to be valid but to be valid they also must be reliable. I have no indication in any of this that the guessers--who are providing "preference" data--are doing anything in that realm except guessing.
As noted even the 'best' panelists have to guess when the beers are indistinguishable and that's exactly what we want them to do. As I said above guessing is an important feature of this test - not a flaw.


And to a guy in my field, "guessing" is the antithesis of reliability. Without reliability you cannot have validity--and it's very hard for me to see either here.

I've explained it as clearly as I can and if you can't see it then I would, depending on your level of interest, suggest pushing some numbers around or even doing a Monte Carlo or two if you are so inclined. The main disconnect here is that you are arguing that a statistical test, even though more powerful than another, is less valid than the other. That can only mean that could lead us to take the wrong action which, in this case, would imply that asking our hypothetical brewer using the more powerful test to decide against using the low valine malt even though it does improve (defining improve as reduction in diacetyl). I don't see how that could possibly happen.
 
After reading this whole thread plus the Brulosophy exbeeriments I will say that it does cause one to wonder about their brewing habits. Whether some parts of my brewing day should be changed or not as a result of all of this, probably not.

One thing that comes to mind, is that when all these tests are done the tasting is more or less a controlled environment. But for the average beer drinker who is out tasting flights or drinking these beers with a meal, one will probably never notice a difference. Being that their palates are no longer clean and uncontaminated by other variables, makes it hard to tell if there was a slight change. Yes if the change was great enough even a dirty palate should notice a difference.

For now though I will carry on as always, and stick with the thought of, "if it ain't broke, don't fix it".
 
I've explained it as clearly as I can and if you can't see it then I would, depending on your level of interest, suggest pushing some numbers around or even doing a Monte Carlo or two if you are so inclined. The main disconnect here is that you are arguing that a statistical test, even though more powerful than another, is less valid than the other. That can only mean that could lead us to take the wrong action which, in this case, would imply that asking our hypothetical brewer using the more powerful test to decide against using the low valine malt even though it does improve (defining improve as reduction in diacetyl). I don't see how that could possibly happen.

We're going to have to agree to disagree about all this. Partly you want these tests to be about specific elements of the beer (diacetyl, e.g.), whereas I'm looking for areas that potentially confound the results.

And allowing the guessers to be part of preference trials may be necessary for the statistical elements of tests to be met, but it is the antithesis of good measurement to include tasters who can't tell a difference. They've already indicated they have no preference since they can't even tell them apart. Then a forced choice introduces noise into the data, which makes no sense at all.

Frankly, I think some of this testing is an attempt to hide, under the veneer of "scientific" statistics, the fact that the tasting panels are suspect. I'd be much more inclined, if I wanted to do a preference test, to just ask people which they preferred, and bag the triangle test. Randomly assign which beer was tasted first, then see what you get.

If I can ever figure out how to do one of these exbeeriments at a level which will satisfy my desire to make it well-controlled, I'll do some of this. Problem is, I have a single 5-gallon system. Can't split a 10-gallon batch which is what I'd really like to do. I'd do one of these "ferment warm versus ferment cool" exbeeriments and then see what we can see. And I'd put some rules on tasters as to the conditions under which they can taste the beers. Not after a lot of other beers, not after eating a garlic-and-onion burger, like that. Repeated triangle tests, something that would validate their inclusion via ability to repeat their results.

Maybe this is all just a justification for an upgrade to my system? :) :)
 
Lemme put it this way, you can have have taproom staff and your regular customers (none of whom are trained tasters) notice- consistently enough to be a pattern- smaller tweaks to beers (before they're aware they're there) than the variables Brulosophy experiments test.

Maybe that's scale, maybe not. So, grain of salt. That doesn't make what they're doing invalid, just that you need to see the limitations of one group doing unreplicated experiments with small samples of partially unknown composition in unknown settings. If you're blindly trusting it as some here seem to you're a fool.
 
Most seriously: their experiments can not be replicated reliably by others. This is a key element of the scientific process.

I'd be quite interested in hearing about this. Who has failed to replicate their results?

Anecdotally, I tried Marshall's quick ale fermentation schedule and am not terribly pleased with the result. I might change my mind once the beer is done dryhopping, but I get a weird ester taste which I _think_ is a temp-related fruit flavor, which I didn't expect from fermenting at 64 degrees with US-05.
 
Perfect. Because I dry hopped a beer for 2 days and it is the best IPA I have made so far, but when I told some "pros" they were giving me crap while I just think the proof is in the pudding. Had I not told them I dry hopped for 2 days vs a week or whatever I think they would have had a completely different opinion on the beer and commended me for the brew :p :tank:

Interesting. I am in a brewing school and when Matt Brynildson (Brewmaster at Firestone Walker) presented his Hops lectures, he did show studies that indicate after 48 hours you really get nothing of value from dry hopping. Your 2 days makes perfect sense.

When Phil Leinhart (10 years at AB and Brewmaster at Ommegang) presented Wort boiling and Cooling, he very specifically covered Cold Break as part of that.

Hot Trub Cold Trub
Proteins 50-60% 50-70%
Tannins 20-30% 20-30%
Hop Resins 15-20% 6-12%
Ash 2-3% 2-3%
Particle Size 20-80 &#956; 0.5-1.0 &#956;

So Pro's do pay attention to ongoing studies, research, etc. You have to understand that for commercial brewers, they need to look beyond whether the beer tastes good through the life of a fresh 5 gallon batch....their beer needs to maintain quality for several months at least. They are concerned with hotside aeration, dissolved oxygen post fermentation, dissolved oxygen in packaging, beer staling precursor's, storage temperatures, etc. and most of all CONSISTENCY in a product line. To change a recipe or process is a HUGE thing and is not done without considerable thought, research, and sensory analysis.
 
Lemme put it this way, you can have have taproom staff and your regular customers (none of whom are trained tasters) notice- consistently enough to be a pattern- smaller tweaks to beers (before they're aware they're there) than the variables Brulosophy experiments test.

Maybe that's scale, maybe not. So, grain of salt. That doesn't make what they're doing invalid, just that you need to see the limitations of one group doing unreplicated experiments with small samples of partially unknown composition in unknown settings. If you're blindly trusting it as some here seem to you're a fool.

I think you are the fool... for calling others fools in what has been a decent conversation.
 
So Pro's do pay attention to ongoing studies, research, etc.


SOME do. In my city, we have a lot of breweries, but only a handful are actually making good beer. I've talked to brewers at some of the bad breweries and they've brashly admitted they don't pay attention to the industry or any developments.

A few of these places have closed since I spoke to them. We're not talking about breweries who need to bottle each batch the same, we're talking about breweries who don't distribute, other than growlers, and can afford to take some chances and make things a little differently batch to batch.
 
Ugh.....

We are ALL fools.

So are we all chumps or are we all fools? Ugh to you, what great additions you've made to this conversation. You insulted everybody famous and now you're insulting all us. Welcome to my ignore List and have a nice life chump.
 
Interesting. I am in a brewing school and when Matt Brynildson (Brewmaster at Firestone Walker) presented his Hops lectures, he did show studies that indicate after 48 hours you really get nothing of value from dry hopping. Your 2 days makes perfect sense.

Yeah, I believe the brewing folks at Oregon State University did some research on this a few years back. It definitely changed my process on dry-hopping. I rarely dry-hop longer than 3-4 days, then cold crash. Given that the hop presence is one of the quickest things in a beer to fade, leaving it on the hops for up to 14 days seems like it would have allowed the hop character to fade unnecessarily if all the oil extraction happens in 48ish hours.

I wonder how this applies to a somewhat standard homebrew process of dry hopping in the keg. I've heard a lot of brewers do this and anecdotally they say that it preserves hop character throughout the life of the brew. But if all the oils are extracted in 48 hours, it seems strange that it would continue contributing to flavor over this time frame...
 
We're going to have to agree to disagree about all this.
I'm not really sure there is that much disagreement.

Partly you want these tests to be about specific elements of the beer (diacetyl, e.g.),
Actually I want it ti be about whatever the investigator is interested in. If H1 is "Valine rich malt decreases diacetyl" then the focus is on diacetyl. If H1 is "Valine rich malt improves customer acceptance in the demographic represented by this panel" then the focus is on preference. As I've said several times before these are different tests with regard to the panel selection but not the beer. In the diacetyl case you want a panel sensitive to diacetyl which you verify is the case by doing the test with diacetyl spiked samples. You can't do that with a preference testing panel but you can take steps to insure that the panel is representative of the demographic you are interested in.

...whereas I'm looking for areas that potentially confound the results.
They abound and that was the point of my original post in this thread. If the beer differs detectably in any other attribute than the one we are interested in the triangle part of the two stage test is immediately invalidated. The example I have used before is color. If use of the valine rich malt changes the color in addition to the diacetyl and the panelists can see the color the test is invalid. The probability under H0 is small and the investigator is lulled into rejecting it because of something that has nothing to do with the parameter he is interested in whether the question is as to perceived diacetyl or preference. This is why my original and several follow on posts emphasized that the investigators have to be very thoughtful about the design and conduct of the tests and my suggestion that if there were a flaw in Brulosopher's approach that it might well lie in this area.

And allowing the guessers to be part of preference trials may be necessary for the statistical elements of tests to be met, but it is the antithesis of good measurement to include tasters who can't tell a difference.
That depends on what the investigator is interested in. If he wants to know about diacetyl he shouldn't empanel a group that doesn't have demonstrated sensitivity to diacetyl (as demonstrated by triangle tests with diacetyl spiked beers). But in the preference case (H0: "Valine rich malt does not improve customer acceptance in the demographic represented by this panel") we want the panel to include people who can't tell the difference if we are interested in selling beer to a demographic which includes people who can't tell the difference. For such a panel it is possible that H0 may be true. If asked about preference a diacetyl sensitive panel would probably enthusiastically endorse the beer made with the valine enhanced wort causing the investigator to reject H0 and, given that he is interested in a market that has a decent proportion of people who can't tell the difference, thereby commit a type I error.

They've already indicated they have no preference since they can't even tell them apart. Then a forced choice introduces noise into the data, which makes no sense at all.
As noted above, sometimes it does. Type I errors can be as damaging as Type II. It appears that failure to reject H0 when we should accept it (Type I) is the threat in preference (subjective) investigations whereas Type II errors (failure to reject H0 when we should) is the threat in tests in which an objective (e.g. more or less diacetyl) answer is sought. In those cases guessers do introduce noise but as noted in my last post we could easily reduce that noise by using quadruplets rather than triplets. The fact that this is not done indicates (to me, anyway) that the amount of noise injected in a triplet test is not problematical or at least that the reduction going to a quadruplet test is not justified by the extra difficulty in manipulating quadruplets).


Frankly, I think some of this testing is an attempt to hide, under the veneer of "scientific" statistics, the fact that the tasting panels are suspect.
My impression, as an engineer, is that people in fields like biology, medicine, the social sciences, finance and many others, go to college and are given a tool kit of statistical tests which they then apply in their careers and eventually come to a pass where they are plugging numbers into some software package without remembering what they learned years ago in college and thus not fully understanding what the results they get mean. Engineers do this too, BTW. In homebrewing I think you find people scratching their heads over what data they get from their homebrewing experiments mean and then they discover an ASBC MOA into which they can plug their numbers and get a determination as to whether they are 'statistically significant' or not without having a real idea as to what that means. This is kind of tricky stuff. If I go away from it for even a short period of time I have to sit down and rethink the basic concepts. Maybe it's just that I am not intrinsically good at statistics or don't have much experience with it but as the discussion shows here there are many subtle nuances in how experiments are conducted and how the data are analyzed. As engineers say "If all the statisticians in the world were laid end to end they wouldn't reach a conclusion." It's supposed to be a joke but it is true because of the fundamental nature of statistics: it is a guessing game. That's why a statement
And to a guy in my field, "guessing" is the antithesis of reliability. Without reliability you cannot have validity--and it's very hard for me to see either here.
from a statistician kind of surprises me. Everything we observe is corrupted by noise. We cannot measure voltage with a voltmeter. We can only obtain from it an estimate of what the voltage is and must recognize that the reading is the true voltage plus some error. Statistics is the art of trying to control or at least quantify that error so that the guesses we are ultimately forced to report represent the truth at least fairly well. Well, that's my engineer's perspective on it.

I'd be much more inclined, if I wanted to do a preference test, to just ask people which they preferred, and bag the triangle test. Randomly assign which beer was tasted first, then see what you get.
Interesting that you say that as just this morning I came up with a test. A number of participants are presented n objects one of which is different from the others. The instructions to the participants are:

"You will be given a number of objects and a die. One of the objects is different from the other. Identify it. If you cannot use the die to randomly pick one of the objects. Separate the object you picked and one other object from the group. Now choose which of these two objects you prefer. If you cannot decide on one or the other use the die again (or a coin) to randomly select one."

Thus the test you propose is the first part of my test with n = 2 and the triangle test is first part of my test with n = 3. The following sets of numbers show probabilities, under the null hypothesis, that 10 out of 20 testers will chose the different object correctly AND that 5 of those 10 will prefer one or the other.
3 TR(20,10,5,1/3,1/2) means that n = 3, the panel size is 20, 10 correctly pick, 5 prefer one or the other, that the probability of picking correctly is 1/n = 1/3 and that the probability of preferring is 1/2. The first number in the next line is the confidence level for the triangle part of the test and the second the confidence level for the two part test.

2 TR(20,10,5,1/2,1/2)
0.588099 0.268503
3 TR(20,10,5,1/3,1/2)
0.0918958 0.0507204
4 TR(20,10,5,1/4,1/2)
0.0138644 0.00802728
5 TR(20,10,5,1/5,1/2)
0.00259483 0.00153428

These numbers clearly show that a triangle test is a more powerful test than a pick one of 2 test (which is why triangle tests are performed rather than pick one of two tests) and that a quadrangle test is more powerful than a triangle test. They also show that the two part test is more powerful than the triangle or quadrangle by itself in that they increase one's confidence in what his data shows him. This is all, of course, under the assumption that the investigator does not step into one of the many potential pitfalls we have discussed.

If I can ever figure out how to do one of these exbeeriments at a level which will satisfy my desire to make it well-controlled, I'll do some of this.
I don't think you will ever be able to do enough experiments to get you past your conception that allowing guesses is a detriment. I think Monte Carlo is a much more promising approach.

Maybe this is all just a justification for an upgrade to my system? :) :)
If H0 is "You shouldn't upgrade your system" the level of support for that is p << 1.
 
Status
Not open for further replies.

Latest posts

Back
Top