I can think of a few reasons why the results aren't as black and white as we'd (ideally) like to see:
A) You have multiple tasters all giving their own responses to the experiment. That's like trying to weigh a series of items using a different scale each time. You will likely see a trend appear over time, but since none are calibrated to each other you won't hit an ideal level of precision across the board. A bowling ball will read fine on a bathroom scale or a fruit scale, but will max out a gram scale. An aspirin would be fine on the gram scale, but not the larger ones.
B) Each experiment tries to hold as many variables the same, and tests a specific variable. So the results from that experiment are only applicable within those specific variables. The whirlpool temp experiment may give different results in a blond ale vs a stout vs an IPA. Maybe different varieties of hops give different results. Maybe different yeast strains would have an effect, or base malt, or fermentation profile, etc. Basically, you have to view each experiment as a starting point, or a few data points, and then add that in to your own experience to decide how it is going to impact your brewing.
C) By nature, these experiments have a relatively small sample size. As you add more and more data, trends start to become clearer. Denny and Drew have taken an approach to get more data points, but at the expense of less control over the testing because of the crowdsourcing aspect. We've already seen some potential issues in the first two experiments. I applaud them for taking the right approach and calling suspect data into question, and analyzing the numbers both with and without the outliers.
I was fortunate enough to have done a test of the 120F whirlpool before the most recent XB podcast was posted. I feel that I have gotten some good results from this, and the recent IGOR experiment gave me some confidence that there is at least something to this technique. That makes me want to explore it further to see how I want to apply it to my own brewing. That's all I could want and more out of an experiment like this.