This should be tested the same way as everything else - brew the beer (following the instructions exactly as written with no shortcuts or substitutions) and set up a blind triangle test.
I like the blind triangle test for any significant brewing hypothesis because I think it answers the most important question -- "can a group of people taste the difference with better accuracy than random chance"?
However, I feel there is a second question that is often not given enough attention -- "does it make a difference in terms of objective measurements"? Sensory evaluation is certainly king, but if that's all that mattered then we would simply package our beer when it tasted right and would have no need for a hydrometer. We use devices like thermometers and hydrometers because we can get consistent, repeatable results within a known range of error.
Speaking more generally than just the topic at hand, I feel most confident in any practice when I can take sensory and objective measurements and I see that the evidence points in the same direction.
For example, if I forget to add acid to my mash, I will measure a higher mash pH and a higher extract yield at the end of the mash. I can add some water to fix the yield, and the rest of the process will helpfully erase most of the measurable evidence of the pH issue. However, the sensory analysis at the end tells the final story. I will end up with a beer that tastes "sharp" and with a "harsh bitterness". By adding acid to lower my mash pH, I can measure changes in both pH and extract efficiency, and then confirm that the tasting results are in agreement.
To test the hypothesis presented in the paper, one would just need to measure and record DO readings at various stages of the process and report that data with the sensory results. The control batch could be any process you want to compare, but opting for the "roughest practical wort handling" would probably give a decent comparison (I don't know anybody that aerates their mash using an O2 wand, though that could be fun).
Now that I've written all of this I just had an interesting thought. Maybe the authors of the paper wanted to engage Marshall and Denny in a "peer review". Maybe even get some IGOR's or Brulosophy to run an experiment on this topic. Maybe they never really did any of this testing....