Author Topic: Interpretation of Beer Experiment Results (Read 2063 times)

lupulus · « **on:** August 23, 2017, 10:36:17 am »

There is an abundance of literature on interpretation of experimental results but it seems that many homebrewers continue to ignore this literature and misinterpret experimental results.

EXAMPLES OF MISINTERPRETATION
As examples of this misinterpretation that motivated me to write this topic:
- Stan Hyeronimus recent article on First Wort Hopping
- A friend commenting on a post that there is no difference between step mashing and infusion mashing (and sending a link to the Brulosophy experiment on the subject)

I will do my best to avoid scientific jargon in my discussion below...

POSITIVE RESULTS
A positive result is when the investigator reports a statistically significant difference (commonly 95% or higher probability) between the treatments.
Statistical outcome: under the experimental conditions proposed (very important caveat) it is highly likely the treatments are different.
Practically it means that it is likely that (if you reproduce the design and use the same beer style) your beer will be different with one vs. the other treatment.

NEGATIVE / NULL RESULTS
A negative or null result is when the investigator reports that there is NO statistically significant difference (commonly 95% or higher probability) between the treatments.
Statistical outcome: None. No statistical conclusions can be drawn.
Practically it means that you should ignore the results until further information is collected.

REASONS FOR NEGATIVE / NULL RESULTS
-   A combination of the below
-   The treatment studied has no or minimal effect on beer (this is what most people think it means)
-   The sample size chosen to test the beers was too small, and if larger, they would have detected a difference (insufficient power in stats jargon)
-   Beer style tested not the correct style for the experiment
-   Experimental design had one or more imperfections. This includes not only the brewing ingredients and process itself but also the testing conditions.
-   Experimental design was not correctly executed. (do not take offense, all investigators must consider this possibility)
-   Beer quality not very good, confounding the experimental variable (do not take offense, all investigators must consider this possibility) (this is related to both experimental design and its execution).
-   Random error (aka chance)
-   Others

WRONG INTERPRETATION OF NEGATIVE/ NULL RESULTS
Many brewers keep interpreting null results as “The treatment studied has no or minimal effect on beer”. This is not correct. Any of the listed reasons and at various “weights” could lead to negative/ null results.

SHOULD NEGATIVE/ NULL RESULTS BE PUBLISHED
The answer is almost always YES. The only caveat is that the experimental design must be correctly executed. If the investigator aimed for two 1.050 OG worts, and one of the worts ended at 1.040 unexpectedly, the investigator must repeat the experiment. If it happens twice or more times, then the investigator may be in the presence of an unexpected finding that warrants further study.

CAN SIMILARITY / EQUIVALENCY BE STATISTICALLY TESTED?
The answer is theoretically yes, but the experiment design would be much more complex, time consuming and costly. Because even if proven statistically, results would only apply to the experimental design tested, there is no valid rationale to design equivalency experiments. Experimental designs whose statistical goal is to reject the null hypothesis are much simpler.

SUGGESTIONS FOR IMPROVEMENT
On experimental design, the best suggestion for beer investigators is to perform a thorough literature search. It will not improve our understanding of a question, to design experiments without knowing previous experimental designs, their successes and flaws; it will just create confusion.
On interpretation of experiments, please refer to the REASONS FOR NEGATIVE / NULL RESULTS. Do not over interpret results.
Cheers,

erockrph · « **Reply #1 on:** August 23, 2017, 11:27:06 am »

These are all great points. I'll add that the more focused that an experiment is designed to target a positive result, the less value a null result holds from the same experiment.

And I will also agree that this isn't a knock on the citizen science that is going on at sites like Brulosophy and Experimental Brewing. It's just not feasible for citizen science to present tightly controlled experiments to a large number of testers. For experiments like this, having a larger number of data points is like having a more powerful microscope or telescope - it allows you to see finer detail in the results.

thcipriani · « **Reply #2 on:** August 23, 2017, 11:33:32 am »

Quote from: lupulus on August 23, 2017, 10:36:17 am

Statistical outcome: None. No statistical conclusions can be drawn.
Practically it means that you should ignore the results until further information is collected.

So much this ^^^!

Put another way ( from Wikipedia: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Interpretation )

Quote

If the p-value is not less than the required significance level [...] then the test has no result. The evidence is insufficient to support a conclusion.

Great write-up and super important to keep in mind when reading the results (or non-results, as the case may be) from Brulosophy and Experimental Brewing, thanks for the literacy lesson!

charles1968 · « **Reply #3 on:** August 23, 2017, 12:25:17 pm »

One thing notably missing from the OP in this thread is that a null hypothesis is presumed true until rejected. This isn't the same as being proved true, of course - just as "not guilty" is not equivalent to being proved innocent. (As an aside, there is no such thing as proof in science - only weight of evidence.) But while an experiment that fails a significance test (and therefore fails to reject the NH) does not demonstrate the null hypothesis is true, the presumption nevertheless remains that the NH is true. If many experiments fail to reject the NH, the presumption of truth becomes stronger. If this were not the case, it would be impossible to demonstrate that ineffective drugs do not work.

One valid criticism of many brewing experiments is that the NH is often an unconventional position - eg fermentation temperature does not affect flavour. Ideally the NH should be the conventional belief (temperature affects flavour), and the alternative hypothesis the surprising one (temperature has no affect). The tendency of the experiments to take an unconventional position as the NH is the cause of much of the controversy around the results.

dmtaylor · « **Reply #4 on:** August 23, 2017, 12:33:57 pm »

Quote from: lupulus on August 23, 2017, 10:36:17 am

- Beer quality not very good, confounding the experimental variable (do not take offense, all investigators must consider this possibility) (this is related to both experimental design and its execution).

This is precisely what happened to my experiment in late 2016. I went through all the trouble of brewing two slightly different batches of lager, fermenting them out, gathering tasters who were completely blind to the whole thing........ but the beers were bad enough that the differences were perhaps much more obvious than they should have been. I was planning to publish results, but.... because of this very thing, beer quality, I just couldn't bring myself to bother with it.

More experiments are needed. However, I have come to terms with the (perhaps) fact that I personally am NOT the person to run any more experiments. I'm just not that great of a brewer. I've been lucky sometimes, have won a few awards, but overall my beers are just barely better than mediocre on average. But I continue to work on it. Right now I'm toying with low oxygen brewing (

-- yes, it's true!) to see if that helps at all. My guess is it won't. But we'll soon see. But I digress.

Thanks for bringing up this topic. Just this week I've spent a LOT of hours pondering statistics, browsing a thread on HBT, and I think I finally have a fairly better-than-mediocre understanding of all the statistics and p-value stuff, false negatives and positives, etc. Way too many people are misinterpreting results, and somehow we've got to communicate better to make this stop. Maybe. If we feel like it.

Linkies for those who might be interested:

http://www.homebrewtalk.com/showthread.php?t=632633

http://editorbar.com/upload/ReBooks/2013-4/39a2699a23608d62a95ece703b059e4b.pdf

Wilbur · « **Reply #5 on:** August 23, 2017, 12:55:20 pm »

Quote from: erockrph on August 23, 2017, 11:27:06 am

These are all great points. I'll add that the more focused that an experiment is designed to target a positive result, the less value a null result holds from the same experiment.

And I will also agree that this isn't a knock on the citizen science that is going on at sites like Brulosophy and Experimental Brewing. It's just not feasible for citizen science to present tightly controlled experiments to a large number of testers. For experiments like this, having a larger number of data points is like having a more powerful microscope or telescope - it allows you to see finer detail in the results.

I'm going to have to disagree on your second point. I think that may have been true in the past, but increasingly the knowledge and equipment available mitigates this. There are still some limitations, DO meters are still very expensive for example, but professional grade pH meters, cheap temperature control and logging, and precise brewing equipment narrows the gap between researchers and homebrewers. This does assume that homebrewers are investing in this equipment though. With large conferences like Homebrew Con/NHC, there's also a lot more opportunity for both groups to get large data sets. Take the recent first wort hopping research article, the researcher there sampled 35 participants I believe.

bjanat · « **Reply #6 on:** August 23, 2017, 02:15:56 pm »

Good points, although my view on these common experiences is that the variable tested will most likely drown among the large number of factors that make up a drinking experience. These are beer related, like malt intensity, hop aroma, esters, off flavors, mouthfeel, temperature and carbonation, etc. Or related to surroundings, crowded vs quiet, people gathered after dinner or not, excited about free beers, already drunk. There are too many factors that come into place that would make it hard to pick up a small process variable, like squeezing a malt mag that might extract tannins.

Breweries with tasting panels hace routines for having experienced people sit down regularly in a controlled environment and pick up small differences.

I don't think quality managers in serious breweries dismiss best practices after seeing exbeeriments. Even with the disclaimer that these should not be considered proof, they come across as bro science.

Andy Farke · « **Reply #7 on:** August 23, 2017, 05:47:24 pm »

Quote from: lupulus on August 23, 2017, 10:36:17 am

There is an abundance of literature on interpretation of experimental results but it seems that many homebrewers continue to ignore this literature and misinterpret experimental results.
<snip>
- The sample size chosen to test the beers was too small, and if larger, they would have detected a difference (insufficient power in stats jargon)

I would caution on the flip side is that just because there is a difference, it doesn't mean that it is meaningful. If you have a sample size of 5,000,000 tasters, you're quite likely to get a significant difference--but, it could be so incredibly subtle that it's not practical to worry about.

Quote from: lupulus on August 23, 2017, 10:36:17 am

WRONG INTERPRETATION OF NEGATIVE/ NULL RESULTS
Many brewers keep interpreting null results as “The treatment studied has no or minimal effect on beer”. This is not correct. Any of the listed reasons and at various “weights” could lead to negative/ null results.

Along the same lines, I often see phrases like "almost significant". The thinking is that it is so close that if there were only a few more data points, the results would be conclusively different! Of course, this neglects that the new data could just as easily swing the results in the other direction....

Quote from: lupulus on August 23, 2017, 10:36:17 am

SUGGESTIONS FOR IMPROVEMENT
On experimental design, the best suggestion for beer investigators is to perform a thorough literature search. It will not improve our understanding of a question, to design experiments without knowing previous experimental designs, their successes and flaws; it will just create confusion.
On interpretation of experiments, please refer to the REASONS FOR NEGATIVE / NULL RESULTS. Do not over interpret results.

If I may toot my own horn a bit, I refer interested readers to a recent blog post (including comments by lupulus), which parallels many of those by lupulus here: https://andybrews.com/2017/02/07/are-homebrew-experiments-scientific/

kramerog · « **Reply #8 on:** August 24, 2017, 09:17:28 am »

A thorough literature search, which is difficult for amateurs, would be helpful in designing experiments and establishing the correct null hypothesis. I've seen more than one experiment which doesn't seem to be advancing homebrewing except perhaps to rebut beliefs held superstitiously.

denny · « **Reply #9 on:** August 24, 2017, 09:50:20 am »

For me, the big thing people need to realize and remember is that single experiment doesn't mean there's a conclusive answer. It's only an open door tp more experimentation. For any experiment ti have any vailidity at all, it needs to be repeated multiple times by other people.

pkrone · « **Reply #10 on:** August 24, 2017, 05:16:41 pm »

Quote from: denny on August 24, 2017, 09:50:20 am

For me, the big thing people need to realize and remember is that single experiment doesn't mean there's a conclusive answer. It's only an open door tp more experimentation. For any experiment ti have any vailidity at all, it needs to be repeated multiple times by other people.

I both agree and disagree with this.

I think repeated results on your own system has validity. But the same process on a different system can have different results. That's just homebrewing: every system is a little different. If I'm doing something that works great for me, I don't really care if other people are getting the same results. My test group is my family and friends and if they say, "Damn, this beer is really good." Well, that's good enough for me.

American
Homebrewers Association

Author Topic: Interpretation of Beer Experiment Results (Read 2063 times)

lupulus

Interpretation of Beer Experiment Results

erockrph

Re: Interpretation of Beer Experiment Results

thcipriani

Re: Interpretation of Beer Experiment Results

charles1968

Re: Interpretation of Beer Experiment Results

dmtaylor

Re: Interpretation of Beer Experiment Results

Wilbur

Re: Interpretation of Beer Experiment Results

bjanat

Re: Interpretation of Beer Experiment Results

Andy Farke

Re: Interpretation of Beer Experiment Results

kramerog

Re: Interpretation of Beer Experiment Results

denny

Re: Interpretation of Beer Experiment Results

pkrone

Re: Interpretation of Beer Experiment Results