Spark 4 Rebooting

gmayer · July 6, 2022, 3:26pm

I am experiencing Spark 4 reboots periodically. Over the last couple days it has been happening irregularly. Checking the firmware is 2022-01-21, yet the UI is telling me there is a firmware upgrade, yet it does not seem to update.

It is connected over wifi, shows an address on the display and responds to a browser with the prompt for getting a QR code, but in the info block does not show an IP address.

What would be the appropriate troubleshooting steps for this issue?

Bob_Steers · July 6, 2022, 3:36pm

Could you please run brewblox-ctl log? The Spark service logs both expected and actual firmware version.

gmayer · July 6, 2022, 3:46pm

Firewall blocked first send to termbin, made change to allow out…

https://termbin.com/51bo

Bob_Steers · July 6, 2022, 4:00pm

The firmware version seems in order, but there are repeated connection resets in the Spark service.

The Tilt service has its own issues, with repeated bluetooth errors (100 Network is Down). The Tilt service restarting after every error could explain the Spark connection issues.

If you remove the Tilt service for now with brewblox-ctl service remove tilt (this will not remove tilt configuration or history data), does that stabilize the Spark? We can then hunt down the bluetooth errors.

gmayer · July 6, 2022, 4:16pm

Ugh, well should have done that when I stopped fermentation! I’ve removed the tilt service and restarted the services. Will see if this resolves the problem!

I’m hopeful. What’s weird is I’ve not used the fermentation chamber for a while and it was fine, then a couple days ago the spark4 started rebooting. Still wondering…

Thank you for the suggestion and I will remember to disable the Tilt service from now on!

gmayer · July 6, 2022, 4:17pm

Forgot to ask, any thoughts on why the UI keeps throwing a notice that there is an update for the Spark 4?

Bob_Steers · July 6, 2022, 4:44pm

Normally, the Tilt service should function just fine if the Tilt isn’t broadcasting. It’s passively listening, not making an active connection.

There are multiple steps and checks when a Spark service connects to a controller. One of them is the verification that the controller firmware matches the service firmware. The service broadcasts its current connection state to the eventbus, where the UI is listening.

I suspect that when the service loses connection and immediately reconnects, the UI gets confused by the state changes. After all, is_latest_firmware is false when the service is disconnected.
I’ll have a look at making this information more explicit. Right now the UI has to juggle too many variables to interpret the data it has (Is the eventbus connected? Is the spark service available? Is the controller connected with outdated firmware, or not connected at all?)

gmayer · July 6, 2022, 7:29pm

I will keep an eye on it and how the behavior changes. The last restart was 50 minutes ago, while I was outside. My system really isn’t doing that much, runs the kegerator and fermenter. The Pi seems underwhelmed in terms of CPU and memory usage. The Pi 4b is a little overkill for just Brewblox, but was available and my other Pi 4b is very busy with various tasks.

Seems like having the UI manage all of that, rather than some event handlers that handle those tasks and simply update data the UI reads would be an improvement. I’ve not looked at the source, so can’t really be sure. After 35 years in IT, I’m loathe to dig-in .

gmayer · July 6, 2022, 7:47pm

Bob,

Here’s the latest log after the most recent restart; about 1 min ago.

https://termbin.com/b24b

Bob_Steers · July 6, 2022, 8:16pm

The service has to be aware of connection state as well (it should not be reading/writing blocks if firmware is incompatible). I’m probably just replacing a various boolean flags with enums, so we can have firmware_status: 'DISCONNECTED' | 'INCOMPATIBLE' | 'OUTDATED' | 'OK'.

It’s definitely restarting the controller itself. Could you please connect it over USB, and run brewblox-ctl coredump?

gmayer · July 6, 2022, 10:17pm

Bob,

Will do, it will likely be tomorrow.

gmayer · July 7, 2022, 3:02pm

Here is the coredump:

https://termbin.com/oeh1

gmayer · July 7, 2022, 3:03pm

Should have mentioned it ran for 16 hours before restarting this time. This was taken a few minutes after the restart.

Bob_Steers · July 7, 2022, 3:59pm

Cheers. I’m between a wedding and the party now though, so will look at it tomorrow.

gmayer · July 7, 2022, 4:09pm

Sounds good! Enjoy the festivities!

Bob_Steers · July 8, 2022, 12:12pm

~~It looks like it crashed on simultaneous use of wifi and bluetooth. I believe this is part of the laundry list of fixes that are waiting for release, but I’ll look into it to be sure.~~

Correction: there was a memory allocation failure. From your log, it looks like you have 108 blocks, the vast majority of which have default assigned names. (start with New|) This is significantly more than what you’d expect for a setup that runs a kegerator and fermenter.

I extracted the blocks that were not hardware-related (not a sensor, not a GPIO module), and listed them in a quick and dirty python script that will delete them. You’ll want to run the script on your Pi (or edit the localhost in the url).

clear_blocks.py (2.6 KB)

gmayer · July 12, 2022, 8:50pm

Sorry I didn’t get back to you earlier, didn’t see the post! I’ve run the python script and here is the fresh core dump.

https://termbin.com/il7l

Bob_Steers · July 12, 2022, 10:12pm

It rebooted again after running the script? Did the script remove all unused blocks? (you can see them in the spark service page)

gmayer · July 12, 2022, 10:25pm

It did, I only see one now New|TempSensorInterface-4. It was rebooting constantly throughout the day, but since the one after the script ran, I’ve not heard any reboots. When it was rebooting it would do it for a couple days and then stop for a few days and then start again. So it will take a few days to know that it has stopped.

Bob_Steers · July 12, 2022, 10:45pm

When generating the list of blocks it removed, I filtered to exclude blocks that represent hardware. From a quick check of your blocks, it looks like that’s an actual sensor, but only used by blocks that no longer exist.

Immediately after, or minutes / hours later? When parsed, it’s the same error as before (memory allocation failure in the wifi thread).