chendo

Fixing stuttering in iRacing with overlays, G-SYNC, and triple monitors

chendo — Sat, 02 Aug 2025 04:34:57 GMT

This article is provided without warranty or support, YMMV etc. However, if you are a professional racer (sim or otherwise), I'm willing to offer assistance on this issue for fee. Email me at [gsync AT chen DOT do].

I got into sim racing a couple of months ago and have been struggling to get overlays to work performantly with iRacing but I believe I've figured out at least a bunch of the issues that can occur and I hope this can help others.

My setup

AMD 9800X3D, PBO, IF 2000MHZz
32GB DDR5 @ 6000
NVIDIA 5080
3x LG Ultragear 32GS75Q-B, running at 2560x1440 @ 180Hz
SimHub for overlays

The problem

G-SYNC was not reliably working, and this resulted in noticeable stuttering during races, especially during corners. It was unclear initially why G-SYNC would stop working, as it would be working with overlays, but restarting iRacing sometimes would cause it to not work. Disabling Multi-Plane Overlays did not help here.

Using Nvidia Surround would generally fix G-SYNC, but this would break when overlays were used.

I used a couple of different tools here to help diagnose the issue, including Special K (understanding monitor MPO state), Intel PresentMon (understanding rendering mechanism), and CapFrameX (recording frametimes to visualise stuttering). I used NVIDIA Profile Inspector to enable the G-SYNC support indicator, which gives more detailed information on G-SYNC status.

Measurement of GPU wattage can cause stuttering

Any tool that measures system metrics, especially GPU metrics, can cause stuttering. People have mentioned this especially applies to getting GPU power. In my case, the offending tool was MSI Afterburner which I was using to undervolt. CapFrameX visualised the stuttering quite clearly.

I was also using Nvidia's statistics overlay to render some metrics, including GPU power which I have since disabled, but this doesn't seem to reliably cause stuttering in my case.

Integrated GPU can cause stuttering

I noticed that the Nvidia statistics overlay would show drastically different FPS and FPS 1% metrics compared to iRacing, which was a hint that something wasn't quite right, especially with FPS 1% going as low as 15fps!

Analysing iRacing frametimes with CapFrameX showed stable frametimes after resolving stuttering caused by GPU metrics, but I was still able to see stuttering, which did confuse me at first...

I noticed in Task Manager that DWM (the desktop window manager/compositor) was running on my iGPU, and realised that even though I had SimHub and iRacing using the dGPU, any composition of windows would require the dGPU to copy 7680x1440 worth of image data to the iGPU, process composition on the iGPU, then copying this back to the dGPU to render. I estimate a frame to be about 33MB worth of data, and at 180hz, that's almost 12GB of framebuffer copies every second. When G-SYNC was working, the lower FPS meant that there was less frames being copied so the stuttering was less noticeable.

Unfortunately, it doesn't appear you can force Windows to use the dGPU for compositing. I tried adding DWM to the graphics settings and having it to use the 5080, but this had no effect.

Disabling the iGPU fixed stuttering when compositing was required, which happens when overlays being used. The downside is that you are now unable to use the iGPU for anything else. I hope Windows provides a mechanism to change this in the future.

Multi-Plane Overlays

Multi-Plane Overlays are a relatively new (2018?) GPU technology that provides a mechanism for OSes and apps to render overlays reaaaaally fast. However, it seems that this can cause problems and visual issues so most of the articles you find about it are about disabling it. However, MPOs seem pretty important to performant overlays, so I'm confused as to why iRacing recommends it to be disabled.

Interestingly enough, Special K says that DISPLAY1 supports MPO, but Display 2 and 3 don't support it. I ensured that DISPLAY1 was the center screen, as that's the screen I want overlays to be performant on.

G-SYNC and Overlays

I had some cases where G-SYNC was working (indicator was visible), but something was horribly broken, as iRacing was averaging about 30fps. Turns out what was happening was Nvidia drivers was trying to sync based on both iRacing and SimHub frame updates at the same time, and this resulted in heavy stuttering.

The fix here is to ensure that the global G-SYNC options are for Fullscreen only, then use NVIDIA Profile Inspector to set iRacing to have G-SYNC support for Fullscreen and Windowed.

My monitors have an onboard FPS overlay. I used this to verify that G-SYNC was actually working, as the FPS display will fluctuate.

G-SYNC not working after sleep/wake

The final hurdle was G-SYNC was not working after a sleep / wake. I found some comments and articles that mentioned some monitors have issues on sleep/wake and G-SYNC state, so I tried disabling deep sleep on my monitors which resolved the issue. Working theory is deep sleep + wake on monitors does not properly restore G-SYNC state.

V-Sync off

I don't understand why there are people saying to turn on V-sync in NVIDIA Control Panel but disable it in game. My understanding is that this effectively disables G-SYNC, and when I turn it on, iRacing FPS is clamped to an integer divisor of monitor refresh rate, so in my case [180, 90, 45].

Set in-game FPS limit to a smidge under your monitor refresh rate

You may see tearing if your FPS is higher than the monitor refresh rate. I set mine to 175.

Good luck!

There could be other factors at play in your setup that could cause issues.

If this article has helped you, please reference or link to this post.

Using an existing DSN key in a fresh Sentry self-hosted instance

chendo — Wed, 02 Apr 2025 09:10:24 GMT

TL;DR: It's possible, and not too hard, but has oddities a restart will address.

Due to my self-hosted Sentry instance getting into a weird state (I couldn't get an upgrade to work due not being able to make it past a hard-stop upgrade process), I decided the best option forward was to build a new VM and reinstall a fresh Sentry setup. However, I wanted my existing apps out there to still report in (without updating), which meant that I had to ensure that the DSNs out in the wild would still work.

Thankfully, this wasn't too hard. Given that the Postgres database of my current instance wasn't in the right state, and I didn't have anything important in the existing install, I decided just ensuring the existing DSN URLs worked was what I wanted, rather than dealing with copying the database over.

Disclaimer: I don't work for Sentry, and this might not work for you, and I'm not providing warranty or support of the below.

This worked for me on Sentry 25.3.0.

Getting the data you need from the existing instance

Get the contents of the sentry_project and sentry_projectkey table: docker compose exec postgres psql -U postgres -c "select * from sentry_project;"

docker compose exec postgres psql -U postgres -c "select * from sentry_project;" > project.txt
docker compose exec postgres psql -U postgres -c "select * from sentry_projectkey;" > projectkey.txt

Set up new Sentry instance

Follow the instructions provided by Sentry to install this on your fresh VM.

Rebuild the projects

Now, you'll need to ensure the project ID and the public key are the same for this to work.

I used the Sentry interface to create the projects in the right order to ensure the project IDs were correct. However, I made a mistake and had to resort to resetting the sequence with SELECT setval('sentry_project_id_seq', ); You probably don't have to do it to sentry_projectkey but I did.

Update the keys

To update the keys, set the public_key and secret_key (probably don't need it but I figured it couldn't hurt) to the values.

docker compose exec psql -U postgres

UPDATE sentry_projectkey SET public_key = '', secret_key = '' WHERE project_id = ;

Restart Sentry

There's some kind of caching around the public_keys, which makes sense as event ingestion is a hot path, so you'll need to restart the relevant process. I didn't know which one it was, so restarting Sentry completely worked for me in the end. I did see it working without restarting after one project but it wasn't reliable.

docker compose restart

# nginx sometimes needs a kick
docker compose restart web

# if still no go, complete restart
docker compose stop
docker compose up -d

That's it!

I hope you've found this useful.

Fixing Selenium 4.x timeout errors when using multiple sessions

chendo — Thu, 20 Jul 2023 08:33:03 GMT

TL;DR: Set SE_NODE_MAX_SESSIONS= and SE_NODE_OVERRIDE_MAX_SESSIONS=true if you're running into timeout errors when starting a new session on Selenium 4.x.

I ran into an odd issue while trying to upgrade our Selenium containers to 4.10 from 4.0.1-beta, where our Capybara-based test suite would return Net::ReadTimeout that I narrowed down to tests that would use multiple sessions.

Turns out a change was made where the Selenium containers would limit themselves to a single session by default, so you'll need to set SE_NODE_MAX_SESSIONS to an appropriate number for your setup, and also SE_NODE_OVERRIDE_MAX_SESSIONS=true.

I still had some odd issues, but using seleniarm/standalone-chromium:114.0-20230615 worked, even though our CI is still amd64.

Good luck.

Reliable background geotagging for Sony Alpha cameras

chendo — Sun, 02 Jul 2023 02:44:15 GMT

TL;DR: I got mad about Sony's unreliable geotagging apps, reverse engineered the protocol, and built my own: Geotag Alpha. Get the beta here.

I picked up a Sony a7 IV earlier this year to replace my aging a7R II, and I was initially pleased about the geotagging feature as I tend to shoot most of my photos while travelling, and it'll be nice to know exactly where I took them.

However, I quickly noticed that the Sony Creators' app struggled to maintain a connection to my camera, especially after I turned the camera off (to save battery) and turning it back on. Having the app in the foreground worked, but I'm not going to keep their app open in the foreground every time I wanted to connect.

I got real mad about it, to the point where I nerd-sniped myself the day before I was leaving for a trip to Japand and Taiwan, and set out to figure out how to build a more reliable geotagging solution. With a bunch of reverse engineering and help from other people's posts who've tried to do something similar, I managed to hack together working prototype (where the UI was the default "Hello world" screen) by 11pm that evening.

I've called it Geotag Alpha (currently in Testflight), and it improves upon the Creators' app functionality by:

Properly handling connecting to the camera while in the background
Better energy efficiency. Location updates are only pushed when it changes or at the interval the camera needs, rather than the 7s internal Sony uses.
Support for geotagging multiple cameras simultaneously (coming soon!)

I've tested it with my own Sony a7 IV, and have confirmed it works on Sony a7R IV, Sony a7 IIIs so far.

If you geotag your cameras and you also struggle with the official Sony Creators and Imaging Edge Mobile apps not being reliable, give Geotag Alpha a go!

Join the Testflight beta and let me know if it works for you!

PMSA003I reporting 0 for all values fix

chendo — Sat, 04 Mar 2023 04:28:33 GMT

I have a couple of air quality sensors I've made by cobbling various Adafruit boards together (Magtag, Funhouse, ESP32-C3) that report to my Home Assistant instance for monitoring. One of these stopped working, where the PMSA003I appeared to be working (fan spinning, no errors), but was reporting 0 for all values.

I verified that the main sensor unit was the problem by switching out the StemmaQT board I have to connect it to the Adafruit boards.

I detactched the unit from the board and gave it a blast of air from a Datavac duster, and it is now working again!

Docker in Docker (DIND) MTU fix for docker-compose

chendo — Tue, 21 Feb 2023 04:02:46 GMT

If you're running into weird connection stalling issues when inside a Docker-in-Docker environment, it's rather likely MTU is the culprit. For example, when basic network connectivity works (ping works, curl example.com works) but curl to a https endpoint stalls at TLS handshake, this is likely due your container unable to receive packets larger than a certain value.

Normally, the networking stack is able to discover the MTU using ICMP, however some endpoints choose to block ICMP which causes this to not work.

The solution is to change your container's MTU option by putting this in your docker-compose.yml:

networks:
  default: # or whatever your networks are named
    driver: bridge
    driver_opts:
      com.docker.network.driver.mtu: 1450 # You may need to lower this value further

The --mtu option passed to dockerd only affects the MTU used for pulls/pushes and does not affect containers themselves, which is rather annoying.

Controlling LG MusicFlow Soundbars in Home Assistant

chendo — Mon, 26 Dec 2022 10:46:32 GMT

I wanted to automate most of the bits of playing a vinyl on my record player, and had successfully automated muting/unmuting of the IKEA Symfonisk speakers in my lounge when the record player was being used (power draw spikes to >1W during playback, and <0.3W on idle), however I wanted to automate switching the input on the LG LA855M sound bar that it's connected to as well.

I did some searching and stumbled across https://github.com/mafredri/musicflow, which is a CLI tool written in Go. It works well (although the interface is a bit eh), however I did not want to shave the yak that is getting that hooked up to Home Assistant, nor did I have the time to write a proper integration.

During my searching, I found https://community.home-assistant.io/t/sending-simple-tcp-packets/52070/2 where someone dumped the packets from the MusicFlow app and had success with using nc to send payloads.

I used tcpdump to capture the packets when I sent a command, then used Wireshark to inspect and grab the bytes I needed.

# on machine running mufloctl
sudo tcpdump -vv -XX -w portable.pcap port 9741

# in another shell, send the payload
echo '{"data":{"type":2},"msg":"FUNCTION_SET"}' | mufloctl -addr 

# ^C to end the tcpdump

Once I had the bytes, I wrote a quick script to convert the bytes to \xXX format, so I can use echo and nc to send to the soundbar.

# Switch to optical source

echo "\x10\x00\x00\x00\x30\xec\x01\x16\x5d\x79\x2c\xfc\xcd\x89\x02\x77\x39\x2f\x3a\x33\x5f\xb3\xd5\x76\x24\xad\x84\x59\x71\xa1\x8b\xcd\xbe\xd6\xae\x0f\xfa\x14\x9e\x24\xac\x48\x6a\x4f\x18\xc9\xf5\xed\x0b\x45\xd8\x4b\x85" | nc -vv  9741


# Switch to portable source

echo "\x10\x00\x00\x00\x30\xec\x01\x16\x5d\x79\x2c\xfc\xcd\x89\x02\x77\x39\x2f\x3a\x33\x5f\xac\xcb\x2f\xb1\xeb\x99\xf6\xa4\x47\xb1\xa8\x0a\x01\x13\x64\x0d\x1d\x6f\xe7\xf5\x94\x5e\xec\xc9\x2b\x47\xda\x1d\xc4\x4e\xc8\x6e" | nc -vv  9741

I was rather stoked that this actually worked! I skimmed the musicflow source code and it appears messages are JSON payloads which are AES encrypted with a static key and IV, which makes this possible.

Unfortunately, Home Assistant does not let you use pipes in shell_command, so you will need to create a script that does so, as per https://community.home-assistant.io/t/sending-simple-tcp-packets/52070/2.

Hope this helps.

SwiftUI @State/@Binding objects not updating in Release configuration

chendo — Sun, 13 Mar 2022 11:08:12 GMT

TL;DR: Reflection Metadata Level set to "Off" (potentially a default from older Xcode projects) can cause @State/@Binding to have issues around getting the variable to update.

After banging my head against the wall for a couple of hours, I managed to luck out on an issue that seemed extremely fucking weird and I had much hate for SwiftUI until I figured out the problem.

I was in the middle of building a release to test update mechanism on Shortcat when I noticed the onboarding screen's Next button would not appear to do anything.

The root symptom was the step binding was somehow not being set to .second, and was stuck in .first, confirmed by print statements (trusted tool) before and after:

Button(action: {
    withAnimation {
        print("before: \(step.rawValue)")
        step = .second
        print("after: \(step.rawValue)")
    }
}) {

This resulted in:

before: first
after: first

Which itself is weird as fuck, but what's weirder is it was only happening when being built for Release. For me, if a build setting is causing code to somehow have read-only variables, something is seriously cooked.

I cleaned my build folder, updated Xcode, made a new project and just put the relevant views into a fresh project, set it to Release, and... it worked fine.

What the fuck?

I tried cutting bits of code out that wasn't in the fresh project (dependencies etc) and nothing.

I had to resort to looking at seeing what build options were different between Development and Release, and eventually happened to change the Reflection Metadata Level option to All... and it worked!?

A quick search for swift reflection metadata level, I spotted a link that said "SwiftUI state not updated in older project" and I figured that absolutely must be it.

A Google search result that had the link to a StackOverflow article.

The question was 1 year and 10 months old (at time of writing). I gave the relevant answer and upvote, and figured I'd write a post about it so hopefully someone using the keywords I used will also discover the answer one day.

Fixing Service Topology with nginx-ingress

chendo — Fri, 28 May 2021 02:21:36 GMT

TL;DR: Service Topology wasn't working with nginx-ingress cause the Ingress needs nginx.ingress.kubernetes.io/service-upstream=true.

In order to ensure that our users have their requests serviced as fast as possible, we're moving towards a multi-region deployment where users are serviced by the nearest workloads.

Kubernetes introduced the concept of Service Topology routing in 1.17 which enables Services to be able to define how they should be routed. My initial testing indicate that Service Topology worked as expected when using the Cluster IP of the Service, however nginx-ingress appeared to ignore Service Topology and would do its usual round robin behaviour.

I stumbled across a StackOverflow thread which explained that nginx-ingress looks at the Endpoint structures by default and thus bypasses the Service Topology mechanism provided by Service.

The magical fix for this was to add an annotation to the Ingress in question: nginx.ingress.kubernetes.io/service-upstream=true.

This was all that was needed for Service Topology to work.

Fixing the bugged elevator for Nocturne OP55N1 in Cyberpunk 2077

chendo — Tue, 05 Jan 2021 13:52:43 GMT

This post describes how I managed to fix a friend's save where he was trying meet Hanako at Embers, but the elevator/lift that's being guarded by two bouncers is marked "Off" and he couldn't enter it. I've heard of another issue where the lift is open, but the bouncer is stopping you from entering. I'm not sure if this helps with that issue.

Please note that I am not providing support for this issue, and I am not responsible for any damage this can cause to your save game/computer/etc. Check out the CP77 Modding Tools Discord if you need help.

You need a recent CyberEngineTweaks installed and working (see other resources for how to install as that's beyond the scope of this post), as this workaround requires poking at game state in the console that CET provides. Instructions tested below on v1.06 of the game and a CET build newer than the 5th of Jan, 2021.

Go to the Embers lift with the two bouncers, and trigger the Point Of No Return dialog by walking up to the lift.
Open the CET Console with tilde (`) or whatever it is on your keyboard
Teleport inside the lift with: Game.TeleportPlayerToPosition(-1794.316040,-535.862915,10.11386)
Look at the elevator control panel
Run ts = Game.GetTargetingSystem(); lift = ts:GetLookAtObject(Game.GetPlayer(),false,false):GetDevicePS(); print(lift:GetDeviceState())
The console should display EDeviceStatus : (4294967295)
Look away from elevator control panel
Run lift:SetDeviceState(1)
Look at control panel, and it should update and allow you to go to the Embers floor.

Things that didn't work

Thanks to @SirBitesalot's fact dump, I found a q115_embers_elevator_unlocked flag which I was certain that was it cause it was not set in friend's file. However, this didn't fix the issue.
Teleporting into Embers itself: NPCs were there, but do not react. I believe there's a trigger when the elevator doors open.
Gradually teleporting up the lift well: This does trigger a conversation with Johnny, but doesn't seem to progress further, even teleporting into Embers after that.

Thanks to the modding community for making this remotely possible. No thanks to C++ which still doesn't have a nice way to join an array of strings with a delimiter in its standard library.

Mitigating Octoprint print quality issues with BufferBuddy

chendo — Tue, 27 Oct 2020 01:03:53 GMT

This is a continuation of my deep-dive into understanding print quality issues when printing over USB with Octoprint, and adding buffer monitoring to Marlin.

Now that I had an objective mechanism to measure planner underruns, which we know is the likely cause of print quality issues, we can attempt to mitigate the issue.

My previous investigation with Octoprint's comm.py which manages the communication with the printer indicated that the default behaviour is to wait for an ok from the printer before sending the next command, which means that the command buffer is usually empty by the time the next command reaches the printer.

Generally, the planner buffer is usually kept full and does not usually underrun for the scenarios I have tested (apart from curves on Cura 4.7.1), however any potential delay which could be introduced by e.g. CPU load, resends (noise on USB cable) – can cause planner buffer underruns.

Making Octoprint send multiple commands inflight

The core algorithm to keep the command buffers full is as follows:

Check if the printer is reporting available capacity in the command buffer
Trigger Octoprint to send more commands

So at the minimum, we need a way to detect the available capacity in the command buffers, and a mechanism to trigger Octoprint to send more commands. We know we can use Marlin's ADVANCED_OK output for undersanding the available buffer capacities, we need to figure out a mechanism to trigger sends.

Octoprint's core logic for sending commands is inside comm.py's _send_loop, which is running in a thread, and checks for _send_queue to have something to send, and once command has sent, waits for _clear_to_send.wait() which is a CountedEvent / mutex mechanism that lets other threads tell the thread that it's cool to send another command.

_clear_to_send.set() in another thread is ultimately what causes the next command to be sent, so it looked like a good mechanism to start.

I wanted to add this buffer-filling functionality as a plugin because I feel a tad uncomfortable with introducing significant changes to comm.py, so I began with a rough naive plugin that inspects the ADVANCED_OK output and calls _clear_to_send.set() if there's capacity.

Turns out, this was too naive as the plugin would react to responses that do not reflect the current state of the buffers, and it doesn't know which lines it triggered, so it would cascade into serial buffer overruns extremely quickly. I also discovered the ok buffer size that determines the maximum _clear_to_send can buffer needs to be at least 2, otherwise calling _clear_to_send.set() won't do anything if _clear_to_send is already at 1.

My next attempt would keep track the number of commands inflight and use this as a basis to determine whether or not _clear_to_send.set() should be called, as added a minimum delay between triggering a send, and this worked pretty well considering the amount of code required:

ADVANCED_OK = re.compile(r"ok (N(?P\d+) )?P(?P\d+) B(?P\d+)")
LINE_NUMBER = re.compile(r"N(?P\d+) ")

class BufferMonitorPlugin(octoprint.plugin.StartupPlugin):
    # ok buffer must be above 1
    def __init__(self):
        self.bufsize = 4
        self.max_inflight = self.bufsize
        self.last_cts = time.time()
        self.last_sent_line_number = 0

    def on_after_startup(self):
        self._logger.info("Hello World!")

    def gcode_sent(self, comm, phase, cmd, cmd_type, gcode, *args, **kwargs):
        self.last_sent_line_number = comm._current_line    
            
    def gcode_received(self, comm, line, *args, **kwargs):
        if "ok " in line:
            matches = ADVANCED_OK.search(line)

            if matches.group('line') is None:
                return line

            line_no = int(matches.group('line'))
            buffer_avail = int(matches.group('cmd_buffer_avail'))
            inflight = self.last_sent_line_number - line_no

            if inflight > self.max_inflight:
                # too much in flight, scale it back a bit
                comm._clear_to_send.clear()
                self._logger.info("too much inflight, chill a bit")
                self._logger.info("Buffer avail: {} inflight: {} cts: {}".format(buffer_avail, inflight, comm._clear_to_send._counter))

            if buffer_avail >= 1 and (time.time() - self.last_cts) > 0.5 and inflight < self.max_inflight:
                self._logger.info("sending more")
                queue_size = comm._send_queue._qsize()
                self._logger.info("Buffer avail: {} inflight: {} cts: {} queue: {}".format(buffer_avail, inflight, comm._clear_to_send._counter, queue_size))
                self.last_cts = time.time()
                comm._clear_to_send.set()

        return line

I was able to confirm with buffer monitoring via M576 that it fills the buffers and decreases underruns. Graphs further below.

Note: I discovered that my Ender 3 v2 takes at minimum 9ms to respond to a command, where the timing was measured by inspecting the serial.log of Octoprint. This means that the default behaviour of waiting for an ok to send the next command was capped at ~111 commands a second, and I know that my Cura 4.7.1 sliced 3DBenchy will easily spike to 160 commands a second, which will most definitely cause underruns.

Making a plugin in Octoprint

Now that the core idea has proven to be workable, I began refining the plugin logic and testing with both dry-run prints and real prints to determine reliability.

I followed the Octoprint plugin guide and developed the plugin against a Docker container on my local machine for speed, but ran into different behaviour with the Virtual Printer that comes with Octoprint, so I could only really test the plugin against my Ender 3 v2 on my Octopi setup.

I called the plugin "BufferBuddy" after some deliberation cause the working title "buffer-filler.py" sounded a bit shit.

I ran into issues getting the plugin to load initially which eventually turned out to be a Python version specification issue. Once this was sorted, I was able to copy across the core logic from my prototype and began fleshing it out.

Understanding how to implement a UI took probably three times as long as implementing the core logic, which was hampered by needing to restart Octoprint to see changes. However, I eventually managed to make the UI behave the way I wanted!

Introducing BufferBuddy

It's probably easiest to explain the impact of the plugin with graphs.

The graphs below are from graphing M576 output during a print of a 50% scale 3DBenchy sliced using Cura 4.6.2 and 4.7.1, and with BufferBuddy enabled/disabled, with the leading underrun artifacts caused by the purge line removed for clarity.

My printer is an Ender 3 v2 running Smith3D's Marlin fork which has improvements for the Ender 3 v2's LCD, with my own patches for M576, with BUFSIZE=16, BLOCK_BUFFER_SIZE=16, USART_RX_BUF_SIZE=64, USART_TX_BUF_SIZE=64.

Cura 4.6.2, BufferBuddy disabled

The graphs indicate that we consistently see command buffer underruns, but only see ~6 instances of planner buffer underruns, where the maximum detected period that the planner buffer was empty was under 50ms.

Cura 4.6.2, BuffyBuddy enabled

With BufferBuddy enabled, command buffer underruns are mostly eliminated, with 13 instances of command underruns during print, and one planner buffer underrun with a max underrun period of 9ms.

BufferBuddy output for Cura 4.6.2 print. This includes underruns from the starting gcode, which I'm not sure how to best remove from the above statistics.

For Cura 4.7.1, which we know produces problematic gcode, it's another story.

Cura 4.7.1, BufferBuddy disabled

It shows severe planner underruns, with delays easily going above 75ms. Commands per second easily surpasses 150 per second, which is above the max throughput of ~111 per second which we calculated by measuring command/ack latency above.

Cura 4.7.1, BufferBuddy enabled

BufferBuddy eliminates most of the planner buffer underruns, but there is still command buffer underruns due to the sheer gcode density, although it halves the maximum delay the command buffers remain empty.

BufferBuddy output for Cura 4.7.1 print. This includes underruns from the starting gcode, which I'm not sure how to best remove from the above statistics.

Uploading to SD appears to behave differently with respect to having multiple lines inflight for more throughput, as the command buffer never gets filled, and it seems to be more dependent on serial RX buffer size which we can't easily detect, so this needs more work.

Actual Print Quality

So, now we know that we've significantly mitigated planner underruns with BufferBuddy, we can now see if we've resolved print quality issues with Cura 4.7.1 on an actual print.

Turns out Octoprint can detect when I print from SD from the Ender 3's interface, so I was able to get a "control" print with best case scenario for buffer filling.

Cura 4.7.1, printing off SD card, with Commands Processed added in cause otherwise it would be a pretty boring and flat graph.

Printing from SD showed zero planner and command underruns during the print (tiny smidge at the end probably due to built-in print completion commands), but commands per second peaking above 300 shows that it's going to be extremely hard to keep buffers filled due to latency when it's super dense gcode.

Cura 4.7.1 sliced 3DBenchy at 50%, printed over USB with BufferBuddy active

Cura 4.7.1 sliced 3DBenchy at 50%, printed directly off SD card

Printing of SD is a little bit better in the middle, but still exhibits over-extrusion on curves. The motors seem to make odd sounds during these curves, which is likely related.

For comparison, this is a 3DBenchy from Cura 4.6.2:

Cura 4.6.2 sliced 3DBenchy at 50%, printed over USB with BufferBuddy active.

Putting aside that iPhone cameras aren't amazing at capturing the detail I want (handheld, at least), the Cura 4.6.2 sliced Benchy looks pretty good by comparison still.

Summary.. so far

So, it looks like BufferBuddy doesn't fully address the Cura 4.7.1 problem, but it still mitigates against potential planner underruns by keeping the command buffers full, which still should help against the occasional blip of load on lower-powered devices.

Want to check out the plugin? It's on my Github, but it's still considered experimental and may cause your printer to lock up.

Adding buffer monitoring to Marlin

chendo — Sat, 10 Oct 2020 09:53:31 GMT

This is a sequel of my post about diagnosing 3D print quality when printing with Octoprint.

Now that we have a working theory for the reduced print quality, the next step was to know for sure when the problem was occurring. If I could have the printer tell me when it has no more instructions, it would be the first step to be able to objective measure how often the issue was occurring.

I dived into the source code for Marlin to begin understanding where I can hook into the relevant events, and what kind of metrics I could easily extract.

I consider myself mediocre with C/C++, and it's extremely rusty at best, and embedded C/C++ has its own shenanigans, so I may be wrong about how stuff works.

Understanding what to change

Marlin's core loop is as follows:

/**
 * The main Marlin program loop
 *
 *  - Call idle() to handle all tasks between G-code commands
 *      Note that no G-codes from the queue can be executed during idle()
 *      but many G-codes can be called directly anytime like macros.
 *  - Check whether SD card auto-start is needed now.
 *  - Check whether SD print finishing is needed now.
 *  - Run one G-code command from the immediate or main command queue
 *    and open up one space. Commands in the main queue may come from sd
 *    card, host, or by direct injection. The queue will continue to fill
 *    as long as idle() or manage_inactivity() are being called.
 */

I traced BUFSIZE to queue.cpp, where the logic for reading off serial comms and the command queue is handled. This is how I think it works:

idle() (this seems poorly named?), calls manage_inactivity(), which in turns calls queue.get_available_commands() if there's enough room in GCodeQueue::command_buffer which has max BUFSIZE elements.
queue.get_available_commands() pulls data from serial / SD card, performs basic parsing, checksum validation, early handling
If it's all good and it doesn't need to handle it early, it chucks it in the command_buffer ring buffer via _enqueue, with say_ok flag for that index set to true.
The say_ok flag is set in another array and not immediately sent to the host, which is not what I expected.
Other core tasks are run, such as timer checks, UI, auto-reporting, etc
Finally, queue.advance() is called, which invokes process_next_command(), which parses and executes the command
process_parsed_command() is a behemoth of a switch statement, which figures out what function to run based on the gcode
Once it runs the relevant function, by default, it will call queue.ok_to_send(), which then sends the ok back to the host that Octoprint is waiting for.

While trying to understand how the core loops worked, I spotted the ADVANCED_OK block that exposes the planner and command buffer capacity, which were planner.moves_free() and BUFSIZE - [queue.]length, which is a great start to report.

The problem with understanding buffer underruns with the ADVANCED_OK report, is it can only report the state of those buffers when the ADVANCED_OK is sent. We can infer if it returns B(BUFSIZE - 1)that the command buffer was empty before we sent the command, but we don't have much other information from this.

I considered adding more instrumentation to the ADVANCED_OK response, however it would increase serial comm load on both ends cause it's called on every command, so I figured I needed to add my own gcode that can report the data I wanted, optionally on an interval.

Implementing a buffer monitoring gcode

I looked at the existing gcode list to see if there was anything similar to what I wanted to expose already, but there didn't appear to any I could easily extend. I decided to use M576, as M575 was "Set baud rate", which was somewhat relevant to buffer monitoring.

I used the auto temperature reporting module as a base, and hooked into GCodeQueue::advance() for my logic. A few iterations (which annoyingly requires me to flash via microSD), I had a working M576 command that returns M576 P B U, where:

P is planner buffer available
B is command buffer available (both from ADVANCED_OK)
U is number of command buffer underruns since last report

It also supported M576 S where n is the number of seconds between automatic reports.

Testing the concept

I ran it through a dry-run version of a half-scale 3DBenchy gcode, where all the extrusion instructions were stripped out, and combined with a simple Octoprint plugin I had hacked together, I was able to observe the output of my newly minted gcode command!

Initial findings showed that there was many underruns when printing through Octoprint, anywhere from 5-30 per second. I realised it's not the number of buffer underruns that would be the issue, but how long the queue was in an empty state.

Iterating and adding more metrics

I added M to represent the maximum time in milliseconds that the command buffer was empty between commands. This resulted in much more useful information regarding on how long Marlin may be waiting for a command.

However, I noticed when the max buffer empty time was as high as 100ms, the planner buffer generally remained full, which would mean that the printer would still have movement queued and thus not actually manifest in stalled motion.

I decided to also add planner buffer underrun metrics. This was more difficult to try to figure out where to make the change, as ideally we would detect if the queue was empty immediately after it was to be processed, however this logic was in a dedicated stepper ISR (some kind of interrupt handler) and it seemed like a bad idea to modify that, so I settled for hooking it into the auto_report_buffer_statistics that runs in a fairly tight loop.

I changed the output to the following:

 * When called, printer emits the following output:
 * "M576 P B PU PD BU BD"
 * Where:
 *   P: Planner buffers available 
 *   B: Command buffers available
 *   PU: Planner buffer underruns since last report
 *   PD: Maximum time in ms planner buffer was empty since last report
 *   BU: Command buffer underruns since last report
 *   BD: Maximum time in ms command buffer was empty since last report

Now we can tell when and how long the motion planner buffer was empty for!

Testing methodology

To compare, I sliced a half-scale 3DBenchy with the same custom profile in both Cura 4.6.2 and Cura 4.7.1, as I know from personal experience that Cura 4.7 introduced a bug where it generates extremely dense gcode around curves.

Gcode size comparison between Cura 4.6.2 and 4.7.1.

Cura 4.7.1 generates as much as double(!!) the amount of gcode for the same model and settings, which definitely will cause issues if Octoprint is unable to stream the gcode to the printer to be printed at the speed that it was designed to be printed at.

I ran it through my dry-run.rb and "printed" these with M576 reporting every 2 seconds and chucked it into a log file for later processing.

Some output from the Cura 4.7.1 gcode.

I processed the logs into CSV and chucked it into Numbers to visualise the difference.

The results

Cura 4.6.2

Cura 4.7.1

The key metric here is Planner Max Empty time, represented in red and is in milliseconds, followed by Planner Underruns in yellow as a count.

In Cura 4.6.2, we see only 9 instances of planner buffer underruns, and the max time the buffer was empty was 36ms.

However, 4.7.1 is a whole 'nother story, with 200+ planner buffer underruns.

Histogram of planner max empty time, with zero values removed.

I'm not sure of the threshold where the motion planner buffer being empty results in visible print artifacts, but I feel like anything over 50ms is noticeable. Theoretically it should be able to test this by injecting pauses but that's for another time.

Next steps

Now that we have the ability to have some hard data when the motion planner buffer underruns, we can now attempt to address the issue and measure if it helped or not.

I'll be opening a pull request for the M576 gcode into Marlin.

Stay tuned for the next one!

Diagnosing reduced 3D print quality when printing with Octoprint

chendo — Fri, 09 Oct 2020 12:21:06 GMT

Update: See my post on Adding buffer monitoring to Marlin.

I recently picked up a Creality Ender 3 v2 to finally check out 3D printing. Initially, I was very impressed with the quality of my prints for a budget printer, however, at some point during tinkering, I noticed zits on my prints, specifically around curves.

I eventually tracked this down to a combination of printing over USB in Octoprint vs printing from SD card, as well as switching from Creality's slicer (Cura 4.2) to Cura 4.7.1.

A bug (#8321) was introduced in Cura 4.7 where it was adding loads of tiny segments in curves, which in turn generates way more gcode for the same curve. This, combined with how Octoprint streams the gcode over the USB serial connection to the printer, results in the printer's buffers to empty, thus causing brief pauses as it waits for more gcode, which manifests in zits on the surface of prints as the pressure in the nozzle still causes extrusion to occur.

Even though printing via SD card is likely to resolve the issue, it removes a lot of the convenience and power of printing with Octoprint, such as the ability to selectively cancel certain regions of your print, saving time and reducing waste. Cura's segment issue is likely to be addressed at some point, which will decrease the likelihood of reduced print quality due to less gcode, however the issue with Octoprint's gcode streaming still exists, and has been reported since 2014.

What causes print artifacts when printing over USB?

There are a couple of factors at play that can affect the streaming of gcode to the printer when using Octoprint, especially on embedded-class devices like Raspberry Pis.

CPU load: If the process in Octoprint that's responsible for streaming gcode doesn't get to run due to load on the system, then the printer is likely to be waiting for additional instructions, causing movement stutter and print artifacts. This can occur when just loading the Octoprint interface. There is a reason why stuff like plugin management and timelapse processing is prevented while a print is in progress.
USB cable: I was using a longer cable I had lying around which occasionally would have disconnect issues, which was extremely noticable when I experimented with Klipper (more below). Switching to a shorter cable resolved this particular issue.
Communication speed: It's possible that the gcode is simply too dense for it to be communicated over the USB/serial connection at a rate that it's needed to be processed by the printer to perform the print as desired.

Mitigations

After learning the issue is likely to do with gcode density around curves, I tried to use the ArcWelder plugin which makes gcode smaller by converting the straight G0/G1 commands into G2/G3 arc commands, which can reduce the resulting gcode by as much as 80%!

However, the stock firmware I was using did not have ARC_SUPPORT enabled, so I gave Klipper a shot. The idea of Klipper is to move the motion calculation off the printer's (usually underpowered) microcontrollers, and on to a more powerful device like a Raspberry Pi (compared to a microcontroller, of course). It supplies its own firmware for the printer, which takes a compressed data stream that in turn tells it how to do what to its various stepper motors. Klipper then exposes its own serial port to Octoprint which receives gcode commands.

Using Klipper noticably improved print quality for me, even when using Cura 4.7.x, however the printer display was just blank and not being able to control the printer with its built-in controls was a significant negative. I also ran into severe extruder chattering which caused filament grinding when I was using its pressure advance feature to handle better corners.

Marlin has a DIRECT_STEPPING option which is similar to how Klipper works, however it recommends 250k-500k baud and it doesn't seem to be used very much just yet.

My BLTouch bed levelling sensor eventually arrived so I followed Smith3D Ender 3 v2 BLTouch guide, and used their fork of Marlin which has some nice improvements.

I downgraded Cura to 4.6.2 for the time being, which resolved most of my print quality issues.

However, the core issue of potential print artifacts due to printing over USB still remain, and I'm not willing to give up the convenience of Octoprint, so I investigated further.

Ideally, the code responsible for streaming to the printer should be run at a much higher priority to ensure it gets scheduled. It appears Octoprint's sending_thread in comm.py is being run as a daemon thread, which I initially got excited cause it sounded like it would run it as a separate process (somehow?) which means we could just snice it to a higher priority, but that's not what it does.

My limited understanding of comm.py seems to indicate that Octoprint sends the next command to the printer once it's received an ok from the printer, which happens when Marlin commits the command into its ring buffer. The size of the ring buffer is defined by BUFSIZE, and it's usually set to 4 for most configurations.

This seems pretty small to me, but if Octoprint isn't filing the buffer reliably, then increasing BUFSIZE should decrease the likelihood but not mitigate it completely.

There is a WIP pull request that enables parsing ADVANCED_OK to fill buffers accordingly, however it was last touched September 2019, so it may never get merged in.

Detecting the issue

We should be able to discover the minimum speed that we can communicate reliably to the printer via a benchmark, and in theory, we should be able to look at a gcode file and determine if there are parts of the gcode that cannot be transmitted at the speed that it's meant to be parsed at.

There is also an ADVANCED_OK configuration option which will report the line number of the command being acked, as well as the remaining command buffer and motion planning buffers.

I'm in the progress of writing an Octoprint plugin which should be able to track these buffers, as well as hopefully calculate a median latency of command to response to further diagnose this issue.

Ideally, there should be a way to detect when the command buffer is empty and report back to the host so users can be aware that there are issues happening. Initial research shows that void GCodeQueue::advance() is where the change should be made.

Next up: Adding buffer monitoring to Marlin.

Valve Index Base Station power management

chendo — Sun, 16 Aug 2020 04:06:46 GMT

I managed to get a Valve Index a few weeks ago (which has been great), but the reliability of SteamVR could be a lot better. Especially with the base station power management feature, where it only puts the lasers on standby by default, which causes it to emit a high-pitch whine as the motors are still moving.

Turning on the proper standby power management feature only sometimes works, and often SteamVR won't successfully wake or put base stations on standby, which I've tried to work around by restarting, forcing a discover and connection retry, all of which are extremely annoying to deal with.

There are tools on Github that allow management of these, namely https://github.com/nouser2013/lighthouse-v2-manager (Python/Windows) and https://github.com/jeroen1602/lighthouse_pm (Android, potentially works on iOS), however both seemed like more effort than it was worth.

Looking at the code in lighthouse-v2-manager, specifically https://github.com/nouser2013/lighthouse-v2-manager/blob/master/lighthouse-v2-manager.py#L146, it reveals that it's a simple BLE characteristic write to turn them on and off. Why this appears to be such a difficult thing to do for SteamVR, I'll never know.

I attempted to hack something together in Node with noble as it seemed like the most mature BLE library for the languages I'm familiar with, however I ran into some low-level looking errors so I gave up.

I used to have some TI Sensortags (very nifty devices) and I remembered their iOS app allowed BLE scanning and basic control.

Using the app, I was able to:

See my base stations as indicated by LHB-XXXXXXX
Tap on it to reveal the menu, and tap Service Explorer
Tap the service indicated by UUID 00001523-1212-efde-1523-785feabcd124
Tap the characteristic as indicated by 00001525-1212-efde-1523-785feabcd124 (the first part of the UUID is 1525 rather than 1523)
You can query its current value with Read characteristic. 0x00 means off, 0x01 means on
You can set the power state with Write w/response characteristic. Sending 0x00 will turn it off, and 0x01 will turn it on

This is obviously not a great solution, but considering I don't have to get either of those tools working or build my own, this will have to do until SteamVR can write a byte to their own BLE devices.

TLS v1.3 performance compared to TLS v1.2

chendo — Sun, 12 Jul 2020 03:31:50 GMT

I was doing some certificate maintenance on our cluster and decided it had been a while since I ran an SSL Test, so I ran it against our edge cluster and it gave us a shiny A+.

However, I noticed that our max TLS version was 1.2 rather than the newer and faster 1.3, as 1.3 removes an extra RTT for a faster handshake. Turns out the version of nginx-ingress we were using was still using 1.2 only as default. A quick ConfigMap change later, and we were on 1.3.

I wanted to know what the performance improvement was for such a simple change, so I did some fairly rudimentary tests with time curl --resolve :: --tls-max against an endpoint being served by a Lua script inside nginx itself.

From Australia to our Canadian edge node (217ms away), which represents higher latency setups (either cellular connections, or lack of closer edge termination nodes):

TLSv1.2: ~929ms, 3.4x RTT
TLSv1.3: ~707ms, 2.6x RTT

From Australia to Australia (17ms away)

TLSv1.2: ~112ms, 6.5x RTT
TLSv1.3: ~95ms , 5.5x RTT

Fairly easy performance win if you haven't enabled TLS v1.3 already!