Monday, 23 January 2017

Dark Sky / Forecast.io: avoid using the temperature, humidity, or pressure values of the first hourly block

Here is yet another post about something highly specific that is totally unrelated to everything else on this blog. Enjoy!

If you rely on Forecast.io to obtain weather prediction data, you may also have implemented fancy things like a graph of how the temperature will evolve over the next few hours. Or maybe just a rising/falling trend indicator. This seems very easy to do: just take the “hourly” data block and take the “temperature” values of each entry, and either plot them or compare them. This will work, but you may notice that as the time gets nearer to the next hour change (eg. 09:57), the first point in your graph starts wobbling around wildly every time you refresh the data. Why is this?

A plot made at 16:55, current reported temperature was 0.1°C.

The first hourly data block represents the current ongoing hour. For instance if you retrieve predictions at 09:57, the first block will represent 09:00. One would expect that the folks at Dark Sky would rely on their historical data to fill in the values for that block, given that it represents a moment in the past. For some reason however they merely take the current observations, in this example from 09:57, and extrapolate them into the past using the data for the next predicted hourly block. For instance if it is currently 15.12 degrees and the temperature for 10:00 is predicted to be 15.79 degrees, then the value for 09:00 will be calculated as (15.12-15.79*57/60)/(1-57/60) = 2.39 degrees.

Extrapolation works fine for data that is known to be relatively stable if the extrapolated point is near the interval of known data. In this case only one point is an actual observation though, the other one is a prediction: also an extrapolation in its own right. This makes it doubly dangerous to extrapolate another value out of it, especially if it lies so far away from the two given points.

The morale of the story is: do not rely on the temperature, humidity, or pressure values from the very first hourly data point, unless the hour has only just started. If you want to make a temperature graph, it makes more sense anyway to start the plot at the current time and not at the start of the current hour. Even if at some point the people at Forecast.io update their system to really put historical data in that block, if you would merely dump the values from all blocks in a graph, visitors would still be seeing a curve that starts almost an hour ago when they load your page 09:57, which is not very intuitive.

The same plot as above, but now starting at the actual observation.

After figuring this out, I now start my plot with the current observation and I calculate the trends by comparing the current observation to a linear interpolation at the current time plus one hour. Interpolating also is not ideal but a lot more defendable here than a wild extrapolation.

No comments: