Jamie was at HWT for the last week (5) of the testbed (31 May - 3 June)
Notes from Day 1 overview:
- CAPS radar assimilation ended up not running so there will be no GSD vs. CAPS comparison
 - Plan to continue controlled experiment (CLUE) in future HWT - has been successful but also had a few "lessons learned" in this inaugural year that can be improved upon next year
 - CAM hail size evaluation a focus for this year
- Three hail algorithms: HAILCAST, directo output from the mp scheme (developed by G. Thompson), and machine learning (statistical) technique (Gagne)
 
 - Ensemble sensitivity (Texas Tech work - Brian Ancell)
- Features in flow early in the forecast that impact the ensemble response later (predictability)
 
 - So far they have noticed if the CAMs don't handle overnight convection well they have problems the next day
- There are a lot of solutions between the CAMs in this type of pattern
 
 - Week 5 had a more of weaker shear/multi-cell storm pattern - this is a challenge for CAMs
 - This blog entry is a good overview of what we did each day (http://springexperiment.blogspot.com/2016/05/data-driven.html#more)
 
Notes from sitting with forecasters each day
- opHRRR tends to have
- PBL too warm/dry
 - too much convection
 
 - parallel HRRR (going operational in early July)
- has been decent during HWT
 
 - 5-day MPAS
- performance is region dependent
 - strongly forced systems easier
 - general temporal/spatial coverage OK but not specific storm location
 
 - Thompson mp
- less aggressive cold pools to slow propagation (this was an intentional design based on feedback from previous experiments
 - see the result of this in the statistics
 
 - For verification (subjective during the experiment) they used LSRs (local storm reports), WFO warnings (especially in rural areas where no reports are received), and MESH (maximum estimated size of hail - MRMS hail product)
- Forecasters generally like the MESH - seems to be pretty accurate
 
 - Forecasters generally like the MESH - seems to be pretty accurate
 - If we draw a 5% poly we would want 5 reports for each 100 grid box (at 80 km resolution) area
 - General comment from HWT coordinators over the past 4 weeks
- ARW (HRRR) generally has (incrementally) better performance than NAMRR - but on cases when NAMRR is better, it tends to be much better
 
 - When evaluating probabilities of 40 dBZ or greater they used reflectivity > 40 dBZ as the comparison field
 - In operations, NAM has poor sounding structure near convective initiation
 - Forecasters need to be aware of the CWAs they issue 
- Don't want to change their poly just enough to include a CWA if it wasn't in there previously (unless warranted)
 - They joke that they could put so-and-so's house in a slight risk!
 
 - There is no reward to the forecaster for keeping the poly smaller (to reduce FAR) but they are punished if the area is too small and they miss reports
- Every bust makes them draw larger poly's
 - Only need a handful of reports to verify
 
 - Don't care so much about FARs
 - Hard to decrease probabilities once they are issued to the public ("Thou shalt not downgrade...")
- They tend to err on the side of too low early on to avoid this problem
 
 - How do you evaluate hail forecast if storms are in the wrong spot?!
 - To start eh SFE2016 this post talks a bit about CLUE (http://springexperiment.blogspot.com/2016/05/the-2016-spring-forecasting-experiment.html); the final blog entry to wrapup SFE2016 is here (http://springexperiment.blogspot.com/2016/06/sfe-2016-wrap-up.html)
 
A few of the days they took ~ 2-5 minutes to show some objective statistics from the experiment
- Aggregated ROC for SFE2016 to-date (3-hrly ROC area by forecast lead time)
- Assess mixed core vs. single core - In general, mixed (ARW+NMMB) beats core beats any single (ARW or NMMB) core; for single core, ARW generally beats NMMB
 
 - When looking at FSS: mixed (ARW+NMMB) beats core beats any single (ARW or NMMB) core; for single core, NMMB generally beats ARW at shorter lead times and ARW beats NMMB at longer lead times
- When they compute FSS they do the following:
- Make obs 0/1 and apply smoother to get continuous values between 0-1 in obs
 - Apply a 40 km radius to forecast field
 - Difference forecast probabilities from the observations and look at the squared difference
 
 
 - When they compute FSS they do the following:
 - Does influence of DA extend longer when looking at probabilities rather than deterministic?
 - They compared PQPF to observations by using the same threshold for a single case
 - This blog entry has an example of the ROC curves and PQPF comparison that we looked (http://springexperiment.blogspot.com/2016/05/clue-comparisons.html#more). I can't seem to find a link to these plots on the testbed webpage (http://hwt.nssl.noaa.gov/Spring_2016/), however.