Hi All,
New evohome user and being a techy of course I had to graph my temperatures, especially with winter approaching
After looking at a few different options I've set up munin with a modified version of Infernos munin plugin, which uses watchforstock's evohome python library, all running on a disused Raspberry Pi
One thing I immediately noticed is that I was getting a lot of gaps in my graphs and doing a little bit of debugging directly with some python code calling watchforstock's libraries I realised the issue is nothing local but rather that approximately 20-30% of the calls to the servers are failing with HTTP 500 errors...
My first thought was rate limiting of course, however I'm only polling once every 5 minutes, and dropping the polling back to 10 minutes didn't help either. There is no particular pattern except that the failure rate is pretty consistent and it's always an HTTP 500 error. Another thing that suggests its not rate limiting related is that I can retry just a few seconds later and nearly every time it will go through. And it doesn't matter whether I wait 10 seconds or a minute to retry, the chance of success is about the same.
The symptoms seem awfully like one or more bad servers behind a load balancer - you have a statistical chance of being redirected to the bad server, but an immediate retry will probably direct you to a different, working server, thus succeed.
I modified Inferno's script to try a second time on failure after 20 seconds and that helped a lot but there were still a few gaps, even a 3rd time after another 40 seconds was still showing the occasional gap. Of course I'm not very happy with this as a workaround.
On a hunch I tried switching the script to use the V1 API instead of the V2 API that it had been using and removed the retry mechanism I'd added to prevent that masking any issue - all I'm polling for is zone name, set point and temperature, and both API's provide this, so I don't actually need to use the V2 API.
Presto - no more random server failures, so far it's succeeding every time although I haven't given it a full 24 hours yet to see if there are any gaps in the graph.
Is anyone else noticing the V2 API is a bit unreliable ? I also notice similar unreliability on the Honeywell App on my iPhone - although the App doesn't show errors, I notice that a manual refresh often doesn't show any change even though the data has had plenty of time to propagate from the controller, and despite my graph already showing the updated data (proving it has reached the servers) - a 2nd manual refresh on the phone seconds later does show the updated data. It's as if the phone app silently ignored a failure to connect to the server...
I've also had 10 "Alert: Failed Zone Change" emails from Honeywell in just 4 days of use, in response to trying to send remote commands from the iPhone app, and this was before I ever started trying to poll the servers to graph temperatures... it's a bit disappointing that the servers seem to be this unreliable - without an email alert I would never be aware that the command I just sent from the phone to turn off the heating never actually went through, because the phone UI shows that the command succeeded, and only if you wait a few seconds and do a forced refresh (drag down) will you see the truth that the command did not actually go through.
New evohome user and being a techy of course I had to graph my temperatures, especially with winter approaching
After looking at a few different options I've set up munin with a modified version of Infernos munin plugin, which uses watchforstock's evohome python library, all running on a disused Raspberry Pi
One thing I immediately noticed is that I was getting a lot of gaps in my graphs and doing a little bit of debugging directly with some python code calling watchforstock's libraries I realised the issue is nothing local but rather that approximately 20-30% of the calls to the servers are failing with HTTP 500 errors...
My first thought was rate limiting of course, however I'm only polling once every 5 minutes, and dropping the polling back to 10 minutes didn't help either. There is no particular pattern except that the failure rate is pretty consistent and it's always an HTTP 500 error. Another thing that suggests its not rate limiting related is that I can retry just a few seconds later and nearly every time it will go through. And it doesn't matter whether I wait 10 seconds or a minute to retry, the chance of success is about the same.
The symptoms seem awfully like one or more bad servers behind a load balancer - you have a statistical chance of being redirected to the bad server, but an immediate retry will probably direct you to a different, working server, thus succeed.
I modified Inferno's script to try a second time on failure after 20 seconds and that helped a lot but there were still a few gaps, even a 3rd time after another 40 seconds was still showing the occasional gap. Of course I'm not very happy with this as a workaround.
On a hunch I tried switching the script to use the V1 API instead of the V2 API that it had been using and removed the retry mechanism I'd added to prevent that masking any issue - all I'm polling for is zone name, set point and temperature, and both API's provide this, so I don't actually need to use the V2 API.
Presto - no more random server failures, so far it's succeeding every time although I haven't given it a full 24 hours yet to see if there are any gaps in the graph.
Is anyone else noticing the V2 API is a bit unreliable ? I also notice similar unreliability on the Honeywell App on my iPhone - although the App doesn't show errors, I notice that a manual refresh often doesn't show any change even though the data has had plenty of time to propagate from the controller, and despite my graph already showing the updated data (proving it has reached the servers) - a 2nd manual refresh on the phone seconds later does show the updated data. It's as if the phone app silently ignored a failure to connect to the server...
I've also had 10 "Alert: Failed Zone Change" emails from Honeywell in just 4 days of use, in response to trying to send remote commands from the iPhone app, and this was before I ever started trying to poll the servers to graph temperatures... it's a bit disappointing that the servers seem to be this unreliable - without an email alert I would never be aware that the command I just sent from the phone to turn off the heating never actually went through, because the phone UI shows that the command succeeded, and only if you wait a few seconds and do a forced refresh (drag down) will you see the truth that the command did not actually go through.
Comment