GreenSense improvements and fixes

2 views

Skip to first unread message

John CC

unread,

Aug 18, 2019, 6:13:43 PM8/18/19

to sensorica-ecg, SENSORICA

Sorry to everyone for not replying to emails lately. I've been focused on improvements and fixes to GreenSense.

If you urgently need to get my help with something feel free to add me on Google Hangouts and send me a message. I'll do what I can to help. I've just been a bit distracted and not paying much attention to emails.

Most of the fixes I've been making lately are intermittent issues which only occur occasionally, or only occur after a particular period of time. That's why tests pass and systems seem to work but occasionally these issues show up. Being intermittent or only happening after a period of time means they're harder to detect, and harder to make sure they're fixed.

In a way it's a good sign I'm down to fixing these types of issues mostly, because the systems do generally work. I'm starting to run out of problems to fix and that's a good thing.

I figured it's time to explain some of the work I've done...

1) Email reports

For quite a while a number of elements of GreenSense such as the MQTT bridge (which allows communication between arduino devices and the MQTT broker) and the plug and play system would send an email report if there was any kind of error/exception.

When you install the system you can optionally provide an SMTP server and email address to receive those error reports.

I decided a number of the scripts involved in the GreenSense system needed this same email reporting functionality. Particularly the supervisor scripts.

The supervisor script takes care of things like checking at a regular interval that all devices are online and communicating, and that the MQTT broker is online, as well as automatically upgrading the system and each device if a newer version is detected online, among other things. That auto upgrade functionality means I can roll out bug fixes and any system connected to the web will automatically upgrade itself, unless you disable auto upgrades.

I recently implemented email reporting so you'll receive an email when any of the following happens:

- When the GreenSense system gets auto updated

- When a device sketch/code gets auto updated

- When a device update/upload fails for some reason

- When a device is offline

- When a device is back online after having previously been offline

So now you should be able to install a system, configure the settings for each device, then pretty much forget about it unless you get an email error report.

Now I wouldn't recommend completely forgetting about a system, I would advise checking every now and then that it's still operating and that you've used the right settings (eg. the threshold at which the pump turns on). But it's pretty close to set and forget, until/unless you get a report of something going wrong.

If someone provides install/maintenance services then they can set up email reporting for the customer/client so if something goes wrong they get email reports instead of the customer, and can fix it before the customer even notices there's a problem.

2) Memory overflow issues

With the new "device offline" email reports I started getting device offline errors from my live systems when I shouldn't have been.

At first I improved the status check script and thought this fixed it. But that just made the offline reports less frequent. (I think the status check script actually had a bit of a flaw in it making it sometimes send false offline reports.)

I then realised that there were a few places in the devices sketches which, after a while, were causing memory overflow issues and causing the devices to crash. They would work for a while which is why they passed all the tests, but eventually once the memory overflow issue hit they would just stop.

In C every variable you declare reserves a certain amount of memory. An int or long for example has a maximum length. If the number gets too big for the type of variable it would crash because the value was requiring more space in memory than the variable had reserved.

This was fixed by changing a number of variables from "long" to "unsigned long" and making a few other tweaks to ensure no memory overflow issues occur.

3) Millis overflow/reset issue

Various functionality within device sketches happens at regular intervals. To make this happen it uses a bit of maths and the millis() function (which gives you a number of milliseconds since the devices started running) to trigger certain code only after the specified time/interval.

The millis() function though eventually gets too big for the "unsigned long" type it returns, and it resets. That happens after about 50 days apparently which is why devices were passing tests and working for quite a while, and so it went unnoticed.

I wasn't aware of this issue until I started looking into the other overflow issues. Having fixed them I predicted the millis() function must eventually overflow (ie. get too large for the "unsigned long" type it returns) so I googled it and indeed it does, and a search showed me what to do about it so it doesn't cause any problems.

To fix this required just altering the timing equation slightly so even when millis() resets the maths still works and the timing functionality doesn't break down.

I still need to leave some systems running for over 50 days without resetting to double check this is fixed, but as far as I can tell it's resolved.

4) MQTT bridge code flaw

I found a flaw in the MQTT bridge code which caused it to crash occasionally. It would sometimes attempt to publish data even when the device hasn't output any.

Because it only happened occasionally it went unnoticed most of the time and past all tests.

I fixed this issue and now it seems more reliable.

5) WiFi/MQTT disconnect issue

Occasionally WiFi or MQTT connections will drop out and code needs to detect this and trigger a reconnect.

While I managed to fix the issues with arduino devices going offline (due to the memory overflow issues) I was still getting offline errors for the WiFi soil moisture monitor and WiFi irrigator.

Eventually I realised that their connection was dropping out occasionally and I hadn't added the code to detect this and trigger a reconnect.

I updated the code so after a disconnect it will reconnect and now I'm not getting offline error reports anymore for the WiFi versions.

I also updated the MQTT bridge and system UI controller app with the same ability to detect an MQTT disconnect and trigger a reconnect.

Now this issue seems to be fixed.

6) Improving WiFi device tests

A number of the automated tests for the WiFi versions of devices don't need the device to be connected to the WiFi/MQTT for the test to run (such a sending serial commands via USB or checking the timing of serial output). The WiFi/MQTT connection process takes time so the test had to wait for this to happen so waiting for the connection to complete before running the test slowed down the tests.

Also if the connection dropped out during a test sometimes the test would fail intermittently just because it threw the timing out, despite the fact the device actually was working.

The auto tests verify that the timing of every aspect of functionality is accurate (to within less than half a second margin of error) so if the timing was thrown out by a disconnect/reconnect the overall test would be considered a fail, and need to be rerun. This was incredibly annoying and slowed down development.

So I made it possible for tests to disable the WiFi/MQTT connection on a WiFi/ESP device if it's not needed, to speed up tests, and to prevent the connection causing intermittent test failures.

This change has been completed for the WiFi soil moisture monitor. Now I just need to merge those changes into the WiFi irrigator which should be a fairly easy task.

Considering I will be soon making WiFi/ESP versions of the ventilator and illuminator which will reuse a lot of the same code, it's important to get the automated tests to the point they're as reliable as possible.

These intermittent test failures didn't mean the devices didn't work, it just meant I often had to rerun the tests multiple times before they pass, all because of occasional connection dropouts throwing out timing.

7) Breaking update to the plug and play system

The way the plug and play system previously identified a device was by the group name (eg. monitor, irrigator, ventilator, illuminator) and the board type.

Knowing the group name and board type it then knew what script to call to add a device, trigger and update/upload, etc.

The limitation with this is that the "monitor" group no longer only contain a soil moisture monitor. I've also created a temperature/humidity monitor and a light monitor device.

To include these devices into the index and make them plug and play compatible I needed a way to differentiate them from the soil moisture monitor device.

So I made some updates. Devices now output a "script code" value and the plug and play system will detect this and send it to the scripts which are run when a device is auto added.

This change means that devices can tell the plug and play system which scripts it needs to use to add the device, and which scripts are needed to trigger an auto upgrade.

Not only does it allow the temp/humidity and light monitor devices to be supported, it also allows slight variants in devices to be supported.

For example it would be possible for a device to support multiple types of sensors, and it can tell the plug and play system to launch the scripts which configure the device with that type of sensor. Also I can now add support for various versions of the 1602 LCD display on the embedded system UI and have the plug and play system trigger the scripts which enable support for that particular version.

It's a breaking change because it's not backwards compatible. Any system set up before this change can't auto update itself. The system needs a reinstall (using the reinstall script) and devices need their sketches manually uploaded with the latest sketch which outputs the script code.

This is why I figured it was critical to implement it now rather than later after I start sending out systems to people. Otherwise it would end up breaking live systems.

If you have a live system set up already unfortunately you will need to perform the reinstall and manual sketch uploads now.

8) Infrastructure upgrades

I've been planning some fairly major infrastructure upgrades lately and ordered a bunch of hardware to implement them.

I'll post another email discussing what I've done with that because it's a separate topic.

9) Website based GreenSense system UI

Ken and I have been working on developing a website/web app which can be used to manage the GreenSense system and all attached devices.

I've created a static mockup using a bootstrap template and now we are working on how to implement the backend to connect it to the system itself.

This web app will be a far better way to manage a GreenSense system than the Linear MQTT Dashboard android app, and will provide a lot more functionality.