CHANGES.md

CHANGES.md — Porter Robot Change Log

Append-only, chronological record of every significant change. Each entry documents the problem, before/after code, and rationale.

Last updated: 08 Mar 2026

1. Remove getDeviceInfo() Between initialize() and turnOn()

File(s): src/ydlidar_driver/src/sdk_adapter.cpp

Problem: Calling getDeviceInfo() as an extra serial command between initialize() and turnOn() corrupted the protocol state on single-channel LIDARs (X4 Pro / S2PRO). The turnOn() call would fail with "Failed to start scan mode -1" because the device's one-way protocol was mid-stream with the unexpected device info request.

Before

if (laser_->initialize()) {
  initialized_ = true;
  log(kLogInfo, "YDLIDAR initialized successfully");

  // Query device info for logging
  device_info info;
  if (laser_->getDeviceInfo(info, 0)) {
    log(kLogInfo, "Device model: " + std::to_string(info.model));
  }

  return AdapterResult::kSuccess;
}

After

if (laser_->initialize()) {
  initialized_ = true;
  log(kLogInfo, "YDLIDAR initialized successfully");

  // NOTE: Do NOT call getDeviceInfo() here.
  // The SDK already queries device info during initialize().
  // Sending additional serial commands between initialize() and
  // turnOn() can corrupt the protocol state and cause scan start
  // failure on some models (observed on X4 Pro / S2PRO).

  return AdapterResult::kSuccess;
}

Why: The YDLidar SDK's initialize() already queries device info internally. Single-channel devices use one-way communication — any extra serial command between init and scan start breaks the protocol handshake. The fix removes the redundant call and documents the constraint.

2. Add Retry Logic with Exponential Backoff to start_scan()

File(s): src/ydlidar_driver/src/sdk_adapter.cpp, src/ydlidar_driver/include/ydlidar_driver/sdk_adapter.hpp

Problem: The motor needs ~1 second to spin up to operating speed. The first turnOn() call could fail on cold start because the motor hadn't reached operating speed before the SDK sent the scan start command. The SDK's single internal retry was insufficient.

Before

AdapterResult SdkAdapter::start_scan()
{
  if (!initialized_) {
    log(kLogError, "Cannot start scan: not initialized");
    return AdapterResult::kInitFailed;
  }

  if (laser_->turnOn()) {
    scanning_ = true;
    log(kLogInfo, "YDLIDAR scan started successfully");
    return AdapterResult::kSuccess;
  }

  log(kLogError, "Failed to start YDLIDAR scan");
  return AdapterResult::kScanStartFailed;
}

After

AdapterResult SdkAdapter::start_scan(int max_retries)
{
  if (!initialized_) {
    log(kLogError, "Cannot start scan: not initialized");
    return AdapterResult::kInitFailed;
  }

  if (scanning_) {
    log(kLogWarn, "Scan already running");
    return AdapterResult::kSuccess;
  }

  // Retry loop — the motor needs time to spin up on some models.
  int backoff_ms = 1000;
  for (int attempt = 1; attempt <= max_retries; ++attempt) {
    log(kLogInfo,
      "Starting YDLIDAR scan (attempt " + std::to_string(attempt) +
      "/" + std::to_string(max_retries) + ")...");

    if (laser_->turnOn()) {
      scanning_ = true;
      log(kLogInfo, "YDLIDAR scan started successfully");
      return AdapterResult::kSuccess;
    }

    std::string err = laser_->DescribeError();
    log(kLogWarn,
      "Scan start attempt " + std::to_string(attempt) +
      " failed: " + err);

    if (attempt < max_retries) {
      log(kLogInfo,
        "Waiting " + std::to_string(backoff_ms) +
        " ms for motor spin-up before retry...");
      std::this_thread::sleep_for(std::chrono::milliseconds(backoff_ms));
      backoff_ms = std::min(backoff_ms * 2, 5000);
    }
  }

  log(kLogError,
    "Failed to start YDLIDAR scan after " +
    std::to_string(max_retries) + " attempts");
  return AdapterResult::kScanStartFailed;
}

Why: The 3-retry loop with exponential backoff (1s → 2s → 4s) gives the motor enough time to reach operating speed. This eliminates cold-start failures while keeping the timeout bounded.

3. Defer rclcpp::shutdown() Out of Node Constructor

File(s): src/ydlidar_driver/src/ydlidar_node.cpp

Problem: Calling rclcpp::shutdown() directly inside the ROS 2 node constructor destroyed the RCL context while the node was still being constructed, resulting in RCLError: failed to create guard condition: the given context is not valid. This crash occurred whenever LIDAR initialization failed.

Before

if (!initialize_and_start()) {
  RCLCPP_FATAL(
    this->get_logger(),
    "Failed to initialize YDLIDAR — node will shut down");
  rclcpp::shutdown();
  return;
}

After

if (!initialize_and_start()) {
  RCLCPP_FATAL(
    this->get_logger(),
    "Failed to initialize YDLIDAR — node will shut down");
  // Schedule shutdown outside the constructor to avoid
  // destroying context while node is still being constructed.
  shutdown_timer_ = this->create_wall_timer(
    std::chrono::milliseconds(100),
    [this]() {
      shutdown_timer_->cancel();
      rclcpp::shutdown();
    });
  return;
}

Why: ROS 2 node construction must complete before the context may be torn down. A 100ms deferred wall timer allows the constructor to finish, the executor to attach, and then cleanly invokes shutdown on the next spin cycle.

4. Change singleChannel Default from false to true for X4 Pro / S2PRO

File(s): src/ydlidar_driver/config/ydlidar_params.yaml, src/ydlidar_driver/src/ydlidar_node.cpp

Problem: With singleChannel: false, the SDK sends commands and waits for response headers that never arrive on single-channel devices (X4 Pro, S2PRO). Every command (health, device info, scan start) timed out — the LIDAR appeared completely non-functional.

Before

ydlidar_node:
  ros__parameters:
    singleChannel: false                # Single-channel communication mode

After

ydlidar_node:
  ros__parameters:
    # CRITICAL: singleChannel must match the LIDAR model's protocol!
    #   Single-channel LIDARs (one-way comms): X4, X4 Pro, X2, X2L, S2, S4, S4B
    #   Dual-channel LIDARs (two-way comms): G4, G4 Pro, G6, G7, F4 Pro, TG series
    #   Setting this wrong causes ALL serial commands to time out.
    singleChannel: true                 # Single-channel (one-way) protocol — X4 Pro, S2, S4, X2

Why: The X4 Pro (S2PRO internally) uses one-way serial communication. The singleChannel parameter must match the device's protocol — it determines whether the SDK waits for response headers. Added comprehensive documentation of which models use which protocol.

5. Override SensorDataQoS to RELIABLE for RViz2 Compatibility

File(s): src/ydlidar_driver/src/ydlidar_node.cpp

Problem: rclcpp::SensorDataQoS() defaults to BEST_EFFORT reliability. RViz2's LaserScan display subscribes with RELIABLE — the QoS policy mismatch meant no data was shown, and the log printed "incompatible QoS RELIABILITY_QOS_POLICY" endlessly.

Before

scan_pub_ = this->create_publisher<sensor_msgs::msg::LaserScan>(
  "scan", rclcpp::SensorDataQoS());

After

// Use SensorDataQoS (BEST_EFFORT + KEEP_LAST) but override reliability
// to RELIABLE so RViz2 (which subscribes RELIABLE) can receive data.
// This is compatible with Nav2 which accepts both QoS policies.
auto scan_qos = rclcpp::SensorDataQoS();
scan_qos.reliable();
scan_pub_ = this->create_publisher<sensor_msgs::msg::LaserScan>(
  "scan", scan_qos);

Why: SensorDataQoS().reliable() keeps the desirable KEEP_LAST history and small queue depth from SensorDataQoS while upgrading to RELIABLE delivery. This is compatible with both RViz2 (RELIABLE) and Nav2 (accepts both). All downstream subscribers must also use RELIABLE.

6. Fix ament_flake8 Import Ordering in porter_lidar_processor

File(s): src/porter_lidar_processor/porter_lidar_processor/processor_node.py

Problem: ament_flake8 uses isort with force_sort_within_sections=true, which sorts all imports (both import X and from X import Y) alphabetically by module name, ignoring the import/from keyword. The wrong ordering produced lint failures.

Before

import numpy as np
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import LaserScan
from std_srvs.srv import SetBool
from porter_lidar_processor.filters import (
    downsample_filter,
    median_filter,
    ...
)

After

import numpy as np
from porter_lidar_processor.filters import (
    downsample_filter,
    median_filter,
    moving_average_filter,
    outlier_rejection_filter,
    range_clamp_filter,
    roi_crop_filter,
)
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import LaserScan
from std_srvs.srv import SetBool

Why: porter_lidar_processor sorts before rclpy alphabetically (p < r), regardless of whether the statement uses import or from ... import. Never assume import statements precede from statements when ament_flake8 isort is enforced.

7. Fix pep257 D403: Avoid Acronyms as First Docstring Word

File(s): src/porter_lidar_processor/porter_lidar_processor/filters.py

Problem: pep257 D403 requires the first word of a docstring to be "properly capitalized" — NaN is not recognized as a capitalized word (D403 checks that the first character is uppercase and the second is lowercase). Docstrings starting with acronyms like NaN, MAD, ROI failed lint.

Before

def range_clamp_filter(ranges, min_range, max_range):
    """NaN values outside [min_range, max_range]."""

After

def range_clamp_filter(ranges, min_range, max_range):
    """Clamp range values outside [min_range, max_range] to NaN."""

Why: D403 checks that first_char.isupper() and second_char.islower(). Acronyms as the first word always fail this check. The fix restructures docstrings to start with a standard capitalized English word.

8. Add pep257 Ignores for Google-Style Docstring Sections

File(s): src/porter_lidar_processor/test/test_pep257.py, src/orchestration/porter_orchestrator/test/test_pep257.py

Problem: Using Google-style Returns:, Args: section headers in docstrings triggered pep257 rules D213 (summary at second line), D406 (section name newline), D407 (dashed underline), D413 (blank after last section). These rules enforce numpy-style section formatting.

Before

def test_pep257():
    """Check pep257 compliance."""
    rc = main(argv=['--add-ignore', 'D100,D104'])
    assert rc == 0, 'Found code style errors / warnings'

After

def test_pep257():
    """Check pep257 compliance."""
    rc = main(argv=['--add-ignore', 'D100,D104,D213,D406,D407,D413'])
    assert rc == 0, 'Found code style errors / warnings'

Why: D213/D406/D407/D413 enforce numpy-style docstring sections with dashed underlines. Google-style sections (used throughout the project) use Section: headers without underlines. Adding these to the ignore list permits Google-style while keeping all other pep257 checks active.

9. Add Boot Grace Period and Health Check Patience to State Machine

File(s): src/orchestration/porter_orchestrator/porter_orchestrator/porter_state_machine.py, src/orchestration/porter_orchestrator/config/orchestrator_params.yaml

Problem: DDS discovery takes 1–5+ seconds. The health monitor published STALE before discovering /diagnostics and /scan publishers. The state machine saw STALE ≠ UNKNOWN, entered HEALTH_CHECK, saw STALE/ERROR, entered ERROR, started recovery — all within ~2 seconds, never giving DDS time to discover topics.

Before

def __init__(self):
    super().__init__('porter_state_machine')

    self.declare_parameter('state_publish_rate_hz', 2.0)
    self.declare_parameter('boot_timeout_sec', 30.0)
    self.declare_parameter('health_timeout_sec', 5.0)
    # ... no grace or patience parameters ...

def _handle_driver_starting(self):
    """Check if driver is producing health data."""
    if self.last_health_level_ != 'UNKNOWN':
        self._transition_to(PorterState.HEALTH_CHECK)

After

def __init__(self):
    super().__init__('porter_state_machine')

    self.declare_parameter('state_publish_rate_hz', 2.0)
    self.declare_parameter('boot_timeout_sec', 30.0)
    self.declare_parameter('boot_grace_sec', 8.0)
    self.declare_parameter('health_check_patience_sec', 10.0)
    self.declare_parameter('health_timeout_sec', 5.0)

    self.boot_grace_ = self.get_parameter('boot_grace_sec').value
    self.health_check_patience_ = (
        self.get_parameter('health_check_patience_sec').value)
    # ... store boot_start_time_ and health_check_enter_time_ ...

def _handle_driver_starting(self):
    """Check driver health, with boot grace for DDS discovery."""
    elapsed = (self.get_clock().now() - self.boot_start_time_).nanoseconds / 1e9
    # During grace period, only advance on OK
    if elapsed < self.boot_grace_:
        if self.last_health_level_ == 'OK':
            self._transition_to(PorterState.HEALTH_CHECK)
        return
    # After grace, accept any non-UNKNOWN health
    if self.last_health_level_ != 'UNKNOWN':
        self._transition_to(PorterState.HEALTH_CHECK)

Why: The 8-second boot grace period lets DDS discovery complete before the state machine reacts to health status. If OK arrives during grace, it advances immediately. The 10-second health check patience tolerates transient STALE/WARN during startup. This prevents the rapid STALE→ERROR→RECOVERY loop observed on hardware.

10. Add health_expected_freq Parameter to Decouple Motor Speed from Health Check

File(s): src/ydlidar_driver/config/ydlidar_params.yaml, src/ydlidar_driver/include/ydlidar_driver/health_monitor.hpp, src/ydlidar_driver/src/ydlidar_node.cpp

Problem: The frequency: 10.0 parameter sets the motor target (10 Hz spin), but the S2PRO's actual scan delivery rate is ~3.85 Hz (5K sample rate ÷ 1300 points per revolution). The health monitor compared 3.8 / 10.0 = 0.38, which is below freq_error_ratio (0.5) — declaring permanent ERROR even though the LIDAR was operating normally.

Before

ydlidar_node:
  ros__parameters:
    frequency: 10.0                     # Hz — target scan frequency
    # health monitor uses 'frequency' as the expected scan rate

void configure_health_monitor()
{
  HealthMonitor::Config hcfg;
  // ...
  hcfg.expected_freq_hz = this->get_parameter("frequency").as_double();
  health_monitor_.set_config(hcfg);
}

After

ydlidar_node:
  ros__parameters:
    frequency: 10.0                     # Hz — target scan frequency (motor speed)
    # health_expected_freq: Set to 0 to auto-use 'frequency'.
    # S2PRO motor targets 10 Hz but delivers scans at ~4 Hz.
    health_expected_freq: 4.0           # Hz — actual scan delivery rate

void configure_health_monitor()
{
  HealthMonitor::Config hcfg;
  // ...
  double health_freq = this->get_parameter("health_expected_freq").as_double();
  if (health_freq <= 0.0) {
    health_freq = this->get_parameter("frequency").as_double();
  }
  hcfg.expected_freq_hz = health_freq;
  health_monitor_.set_config(hcfg);
}

Why: Motor target frequency ≠ scan delivery rate. Single-channel LIDARs with high point counts per revolution (1300 pts) deliver complete scans at a fraction of the motor speed. The health_expected_freq parameter (default 0.0 = use frequency) allows per-model override without changing the motor target.

11. Raise Invalid Point and Warn Escalation Thresholds for Indoor Use

File(s): src/ydlidar_driver/config/ydlidar_params.yaml, src/ydlidar_driver/include/ydlidar_driver/health_monitor.hpp, src/orchestration/porter_orchestrator/config/orchestrator_params.yaml

Problem: Indoor S2PRO scans naturally have ~33% invalid/out-of-range points (walls at varying distances, glass, open spaces). The original health_invalid_warn_ratio: 0.3 triggered permanent WARN. In the orchestrator, 5 consecutive WARNs (warn_consecutive_limit: 5) escalated to ERROR after just 2.5 seconds — the system never reached READY state in a normal indoor environment.

Before

# ydlidar_params.yaml
ydlidar_node:
  ros__parameters:
    health_invalid_warn_ratio: 0.3      # Warn if > 30% invalid points
    health_invalid_error_ratio: 0.6     # Error if > 60%

# orchestrator_params.yaml
lidar_health_monitor:
  ros__parameters:
    warn_consecutive_limit: 5           # escalate warn→error after 5 consecutive

After

# ydlidar_params.yaml
ydlidar_node:
  ros__parameters:
    health_invalid_warn_ratio: 0.5      # Warn if > 50% invalid points (33% is normal indoors)
    health_invalid_error_ratio: 0.8     # Error if > 80%

# orchestrator_params.yaml
lidar_health_monitor:
  ros__parameters:
    warn_consecutive_limit: 20          # consecutive WARNs before escalating to ERROR

Why: Indoor LIDARs commonly see 30–40% invalid points from glass, open doorways, and beyond-range surfaces. Raising thresholds to 50%/80% prevents false health alarms in normal indoor operation. Increasing warn_consecutive_limit from 5 to 20 (10 seconds at 2 Hz) prevents steady-state environmental WARNs from escalating to ERROR.

12. Match /scan Subscription QoS in Health Monitor to Driver's RELIABLE Publisher

File(s): src/orchestration/porter_orchestrator/porter_orchestrator/lidar_health_monitor.py

Problem: After changing the /scan publisher to RELIABLE (Change #5), the health monitor's /scan subscription still used the default QoS profile. QoS policy mismatch meant the health monitor never received scan heartbeats, permanently reporting the scan topic as timed-out.

Before

self.scan_sub_ = self.create_subscription(
    LaserScan, '/scan', self._scan_heartbeat_callback, 10)

After

# Match the ydlidar_driver's /scan QoS: RELIABLE + KEEP_LAST
scan_qos = QoSProfile(
    reliability=ReliabilityPolicy.RELIABLE,
    history=HistoryPolicy.KEEP_LAST,
    depth=5)
self.scan_sub_ = self.create_subscription(
    LaserScan, '/scan', self._scan_heartbeat_callback, scan_qos)

Why: ROS 2 DDS requires QoS compatibility between publishers and subscribers. The RELIABLE publisher won't deliver to a BEST_EFFORT subscriber (the match fails). Explicitly matching the QoS profile ensures the health monitor receives scan heartbeats.

13. Fix Docker COPY Path Glob in Dockerfile.dev

File(s): docker/Dockerfile.dev

Problem: The Dockerfile used COPY ../src/*/package.xml ./src/ which attempted to reference paths outside the build context. Docker's COPY instruction cannot reference parent directories. Running docker compose build from the correct root (porter_robot/) would fail because the glob path was wrong.

Before

# Attempt to copy package.xml files for rosdep
COPY ../src/*/package.xml ./src/

After

# Copy package.xml files for cache-friendly rosdep layer
COPY src/ydlidar_driver/package.xml src/ydlidar_driver/
COPY src/porter_lidar_processor/package.xml src/porter_lidar_processor/
COPY src/orchestration/porter_orchestrator/package.xml src/orchestration/porter_orchestrator/

Why: Docker build context is set to the repo root (context: .. in docker-compose). All COPY paths must be relative to that context root. Explicit per-package COPY instructions are used because Docker's COPY doesn't support **/ recursive globs in all contexts.

14. Add YDLidar SDK Build to Docker Image

File(s): docker/Dockerfile.dev, docker/Dockerfile.prod

Problem: The ydlidar_driver package links against the YDLidar SDK C++ library (-lYDLIDAR_SDK). Without installing the SDK system-wide in the Docker image, colcon build failed with linker errors: cannot find -lYDLIDAR_SDK.

Before

FROM osrf/ros:jazzy-desktop
# ... system deps only ...
RUN apt-get update && apt-get install -y \
    build-essential cmake git

After

FROM osrf/ros:jazzy-desktop
# ... system deps ...

# Build and install YDLidar SDK from source
RUN --mount=type=cache,target=/tmp/sdk-cache \
    cd /tmp && \
    git clone --depth 1 https://github.com/YDLIDAR/YDLidar-SDK.git && \
    cd YDLidar-SDK && mkdir build && cd build && \
    cmake .. -DCMAKE_BUILD_TYPE=Release && \
    make -j"$(nproc)" && make install && \
    ldconfig && \
    rm -rf /tmp/YDLidar-SDK

Why: The SDK must be installed system-wide (/usr/local/lib/, /usr/local/include/) for CMake's find_package to locate it. Building from source with --depth 1 keeps the Docker layer small. The ldconfig call ensures the dynamic linker cache includes the new library.

15. Add docker-entrypoint.sh for Proper ROS 2 Sourcing

File(s): docker/docker-entrypoint.sh

Problem: Without an entrypoint script, users had to manually source /opt/ros/jazzy/setup.bash and install/setup.bash in every new shell. Environment variables (ROS_DOMAIN_ID, RMW_IMPLEMENTATION) were not consistently set, causing DDS discovery failures between containers.

Before

# No entrypoint — users must source manually
CMD ["/bin/bash"]

After

#!/bin/bash
set -e
source /opt/ros/jazzy/setup.bash
[ -f /workspace/install/setup.bash ] && source /workspace/install/setup.bash
export ROS_DOMAIN_ID="${ROS_DOMAIN_ID:-11}"
export RMW_IMPLEMENTATION="${RMW_IMPLEMENTATION:-rmw_fastrtps_cpp}"
exec "$@"

COPY docker/docker-entrypoint.sh /docker-entrypoint.sh
RUN chmod +x /docker-entrypoint.sh
ENTRYPOINT ["/docker-entrypoint.sh"]
CMD ["/bin/bash"]

Why: The entrypoint sources the ROS 2 underlay and workspace overlay automatically, sets the domain ID and RMW implementation with overridable defaults, and uses exec "$@" to replace the shell with whatever command follows — making docker exec and docker run work seamlessly.

16. Add GitHub Actions CI Pipeline

File(s): .github/workflows/ros2-ci.yml

Problem: No automated build or test verification existed. Broken builds could be pushed to main without detection.

Before

No CI configuration existed.

After

name: Porter ROS 2 CI
on: [push, pull_request]
jobs:
  build-and-test:
    runs-on: ubuntu-24.04
    container:
      image: osrf/ros:jazzy
    steps:
      - uses: actions/checkout@v4
      - name: Install system deps + SDK
        run: |
          apt-get update && apt-get install -y build-essential cmake git
          cd /tmp && git clone --depth 1 https://github.com/YDLIDAR/YDLidar-SDK.git
          cd YDLidar-SDK && mkdir build && cd build
          cmake .. -DCMAKE_BUILD_TYPE=Release && make -j$(nproc) && make install && ldconfig
      - name: Install ROS deps
        run: rosdep update && rosdep install --from-paths src -r -y
      - name: Build & Test
        run: |
          source /opt/ros/jazzy/setup.bash
          colcon build --cmake-args -Wno-dev
          source install/setup.bash
          colcon test --event-handlers console_direct+
          colcon test-result --verbose
  docker-build:
    # ... builds dev Docker image ...
  lint:
    # ... runs ament_cpplint, ament_flake8 ...

Why: Three-job CI (build-and-test, docker-build, lint) ensures every push is verified. The pipeline installs the SDK from source (matching Docker), runs all 158 tests, and fails on lint violations.

17. Add GoogleTest Suites for Adapter Config and Scan Conversion

File(s): src/ydlidar_driver/tests/test_adapter_config.cpp, src/ydlidar_driver/tests/test_scan_conversion.cpp, src/ydlidar_driver/CMakeLists.txt

Problem: The existing 12 health monitor tests covered only one component. The SDK adapter configuration, parameter validation, and LaserScan conversion logic had no unit test coverage.

Before

if(BUILD_TESTING)
  find_package(ament_lint_auto REQUIRED)
  ament_lint_auto_find_test_dependencies()
  find_package(ament_cmake_gtest REQUIRED)
  ament_add_gtest(test_health_monitor tests/test_health_monitor.cpp
    src/health_monitor.cpp)
  # Only 1 test target: 12 tests
endif()

After

if(BUILD_TESTING)
  find_package(ament_lint_auto REQUIRED)
  ament_lint_auto_find_test_dependencies()
  find_package(ament_cmake_gtest REQUIRED)

  ament_add_gtest(test_health_monitor tests/test_health_monitor.cpp
    src/health_monitor.cpp)

  ament_add_gtest(test_adapter_config tests/test_adapter_config.cpp)
  target_include_directories(test_adapter_config PRIVATE
    include ${YDLIDAR_SDK_INCLUDE_DIRS})
  target_link_libraries(test_adapter_config ${YDLIDAR_SDK_LIBRARIES})

  ament_add_gtest(test_scan_conversion tests/test_scan_conversion.cpp)
  target_include_directories(test_scan_conversion PRIVATE
    include ${YDLIDAR_SDK_INCLUDE_DIRS})
  target_link_libraries(test_scan_conversion ${YDLIDAR_SDK_LIBRARIES})
  # 3 test targets: 61 C++ tests total
endif()

Why: Added 30 adapter config tests (default values, boundary conditions, model profiles, type inference) and 19 scan conversion tests (angle calculation, range handling, NaN/inf, empty scans, field correctness). Total C++ coverage went from 12 to 61 tests.

18. Create Project README.md

File(s): README.md

Problem: The repository had no user-facing documentation. New team members and open-source users had no quick-start guide, architecture overview, or package reference.

Before

No README.md existed at the repository root.

After

Created comprehensive README with sections: Quick Start (Docker), System Architecture (hardware diagram, TF tree ASCII, topic flow ASCII), Software Stack, Packages table, Configuration Files table, Native Build instructions, Testing commands, Development guidelines, License.

Why: README.md is the first thing users and contributors see. Docker quick-start appears before native build per project conventions. All code blocks are copy-pasteable.

19. Create Driver Porting Guide

File(s): docs/Driver_Porting_Guide.md

Problem: Adapting the driver to different YDLIDAR models required deep knowledge of per-model protocol differences, baudrates, and SDK configuration flags. No documentation existed for model-to-parameter mapping.

Before

No porting guide existed.

After

Created 13-section guide with: 14-model reference matrix (baudrate, type, singleChannel, samp_rate, intensity per model), 3 complete YAML configuration templates (X4 Pro, G4, TG15), 7 common gotchas, step-by-step porting procedure, SDK tri_test verification flow.

Why: The driver is designed to be model-agnostic (swap model via YAML only). The porting guide makes this practical by documenting every model's required parameters and gotchas, reducing porting time from hours of SDK experimentation to a 10-minute config change.

20. ESP32 Common Library — CRC16, Protocol, Transport (Tasks 10–12)

File(s): esp32_firmware/common/src/crc16.c, protocol.c, transport.c + headers

Problem: No communication protocol existed between the Raspberry Pi and ESP32 microcontrollers. Needed a shared C library that compiles for both Zephyr (Xtensa) and Linux (x86_64/arm64) with CRC integrity, binary packet parsing, and hardware-abstracted transport.

Before

Empty skeleton files in esp32_firmware/common/.

After

CRC16-CCITT (poly 0x1021, init 0xFFFF): 256-byte lookup table, crc16_ccitt() and crc16_ccitt_byte() APIs.
Protocol: 9-state byte-by-byte parser (IDLE→HEADER1→HEADER2→LENGTH→COMMAND→PAYLOAD→CRC_LOW→CRC_HIGH→COMPLETE/ERROR), encoder with CRC, ACK/NACK generation. 12 command IDs defined.
Transport: Abstract transport_init/read/write/flush/deinit API with 3 compile-time backends (UART, CDC_ACM, Mock) via Kconfig.
60 Ztest unit tests on native_sim: 14 CRC + 26 protocol + 20 transport — all pass.

Why: Pure C for universal ABI compatibility. Compile-time transport selection gives zero runtime overhead. The shared library is linked into both Zephyr firmware AND the ROS 2 bridge nodes.

21. Motor Controller Firmware (Task 13)

File(s): esp32_firmware/motor_controller/src/main.cpp, app.overlay, prj.conf

Problem: No firmware existed for ESP32 #1 to control the BTS7960 dual H-bridge motor driver with safety features required for an airport robot.

Before

Empty skeleton main.cpp.

After

~730-line Zephyr firmware with:

SMF state machine: IDLE → RUNNING → FAULT → ESTOP with proper entry/run/exit handlers.
BTS7960 PWM: RPWM/LPWM for direction, EN for enable, 2 motors (left/right) via LEDC PWM.
Differential drive: (linear_x, angular_z) → left/right wheel speed conversion.
500ms heartbeat watchdog: No command from RPi → auto-stop motors.
Speed ramping: Max acceleration/deceleration limits per control loop.
E-stop: GPIO ISR → immediate motor disable.
zbus channels: motor_cmd_chan, motor_status_chan, safety_event_chan.
Thread priorities: safety(-1) > motor(0) > protocol(1) > reporting(5) > shell(14).
Builds clean for esp32_devkitc/esp32/procpu (110 KB text).

Why: Airport safety requires fail-safe motor control. Heartbeat watchdog prevents runaway robot if RPi crashes. Speed ramping prevents sudden movements near passengers.

22. Sensor Fusion Firmware (Task 14)

File(s): esp32_firmware/sensor_fusion/src/main.cpp, app.overlay, prj.conf

Problem: No firmware existed for ESP32 #2 to fuse multiple obstacle sensors with cross-validation and graceful degradation.

Before

Empty skeleton main.cpp.

After

~750-line Zephyr firmware with:

SMF: INIT → CALIBRATING → ACTIVE → DEGRADED → FAULT.
VL53L0x ToF (I2C): Primary distance sensor, 30–2000mm range.
HC-SR04 Ultrasonic (GPIO): Secondary distance, trigger pulse + echo timing.
RCWL-0516 Microwave (ADC): Presence/motion detection.
1D Kalman filter: Fuses ToF + Ultrasonic with predictive smoothing.
Cross-validation: ToF vs Ultrasonic disagree >30% → flag inconsistency, weight more reliable.
Sensor timeout: 100ms → mark degraded, switch to fallback sensor.
Builds clean for esp32_devkitc/esp32/procpu (119 KB text).

Why: Multi-sensor fusion with Kalman filtering is more reliable than any single sensor. Cross-validation catches drift/failure. Graceful degradation ensures the robot always has some obstacle awareness.

23. ROS 2 ESP32 Bridge Nodes (Task 16)

File(s): src/porter_esp32_bridge/ (10 files)

Problem: No ROS 2 package existed to bridge between standard ROS topics (/cmd_vel, /environment) and the ESP32 binary serial protocol.

Before

No porter_esp32_bridge package.

After

esp32_motor_bridge: Subscribes /cmd_vel (Twist) → differential drive conversion → CMD_MOTOR_SET_SPEED binary packets → serial to ESP32 #1. Sends heartbeat every 200ms. cmd_vel timeout (500ms) → stop motors. Publishes /motor_status and /diagnostics.
esp32_sensor_bridge: Receives CMD_SENSOR_FUSED from ESP32 #2 → publishes /environment (sensor_msgs/Range). Status polling. Publishes /diagnostics.
serial_port: POSIX termios wrapper (8N1, non-blocking reads, EAGAIN handling).
CMake: Links porter_protocol static library from esp32_firmware/common/ — same code on both sides.
All 8 ament lint tests pass (copyright, cpplint, cppcheck, uncrustify, flake8, pep257, lint_cmake, xmllint).

Why: The bridge nodes translate between the ROS 2 world (typed topics, QoS, diagnostics) and the ESP32 world (binary serial). Sharing the protocol library ensures encoding/decoding is identical on both sides.

24. udev Rules & Device Naming (Task 17)

File(s): esp32_firmware/udev/99-porter-esp32.rules, install_udev_rules.sh

Problem: Linux assigns /dev/ttyUSB* or /dev/ttyACM* in arbitrary order depending on USB enumeration. The bridge nodes need stable device paths.

Before

No udev rules. Bridge nodes would need manual port specification every boot.

After

Template udev rules creating /dev/esp32_motors and /dev/esp32_sensors symlinks.
Rules match by USB vendor/product ID + serial number (preferred) or by physical USB port path (fallback).
install_udev_rules.sh copies rules to /etc/udev/rules.d/ and reloads.
Docker compose updated with device pass-through comments.

Why: Stable device names eliminate the "which ttyUSB is which?" problem. Essential for unattended robot operation where USBs may re-enumerate after power cycling.

25. ESP32 Firmware Documentation

File(s): docs/ESP32_Firmware_Guide.md, DevLogs/10_Mar_ESP32_Logs.md

Problem: No documentation existed for building, flashing, testing, or extending the ESP32 Zephyr firmware. New developers would have no reference for the wire protocol, state machines, or transport abstraction.

Before

Only esp32_firmware/README.md (high-level overview) and ZEPHYR_INSTRUCTIONS.md (install guide).

After

ESP32 Firmware Guide (13 sections): Prerequisites, directory structure, wire protocol spec (packet format + all command IDs), build commands, test commands, transport backends, motor controller architecture (SMF + threads + safety), sensor fusion architecture (SMF + sensors + Kalman), ROS 2 bridge usage, udev setup, extension guide (adding commands, sensors, ESP32 variants), troubleshooting table.
DevLog (8 sections): Summary, 30+ files listed, architecture diagrams, design decisions, 9 bugs documented, test results, command reference, lessons learned.

Why: Comprehensive documentation enables the team to build, test, flash, and extend the firmware without tribal knowledge. The troubleshooting table prevents repeating known mistakes.

26. GGUF Quantization with Runtime LoRA Adapters (Task 18d)

File(s): src/porter_ai_assistant/porter_ai_assistant/inference_engine.py, config.py, assistant_node.py, config/assistant_params.yaml, scripts/download_model.py, scripts/convert_to_gguf.py

Problem: The initial approach to model deployment was merge QLoRA adapter into base model → convert merged model to GGUF. merge_and_unload() on a 4-bit NF4 quantized model performs lossy dequantization, degrading model quality. The merged GGUF produced worse output than the unmodified base model.

Before

# Single merged model approach (broken)
class ModelConfig:
    model_path: str = 'models/merged-model.gguf'
    # No adapter support, one monolithic file

After

# Base GGUF + modular runtime LoRA adapters
class ModelConfig:
    model_path: str = 'models/gguf/gemma-3-270m-it-Q4_K_M.gguf'
    lora_dir: str = 'models/gguf'  # Directory containing LoRA GGUF adapters
    default_adapter: str = 'conversational'

# inference_engine.py additions:
def _discover_lora_adapters(self) -> Dict[str, str]:
    """Auto-discover GGUF LoRA adapters in lora_dir."""
def switch_adapter(self, adapter_name: str) -> bool:
    """Hot-swap LoRA adapter without reloading base model."""
def query(self, prompt, system_prompt, adapter=None):
    """Auto-routes to conversational vs tool_use adapter."""

Why: Runtime LoRA loading via llama-cpp-python's lora_path parameter avoids the lossy merge step entirely. The base GGUF (Unsloth Q4_K_M, 241 MB) stays pristine, and lightweight LoRA adapters (7.3 MB each) are loaded/swapped at runtime. This also enables hot-swapping adapters without restarting the model.

27. ROS 2 Topic-Driven AI Query Interface (Task 18e)

File(s): src/porter_ai_assistant/porter_ai_assistant/assistant_node.py

Problem: The AI assistant only supported ROS 2 service calls (~/query). The GUI display needs a simpler publish/subscribe interface — publish a query string, get a response string back — without blocking service call semantics.

Before

# Only service-based interface
self.query_srv = self.create_service(Trigger, '~/query', self._on_query)
self.response_pub = self.create_publisher(String, '/porter/ai_response', 10)

After

# Service + topic-driven interface
self.query_srv = self.create_service(Trigger, '~/query', self._on_query)
self.query_sub = self.create_subscription(
    String, '/porter/ai_query', self._on_query_received, 10)
self.response_pub = self.create_publisher(String, '/porter/ai_response', 10)

def _on_query_received(self, msg: String):
    """Process query from topic subscriber and publish response."""

Why: Topic-based interface is simpler for GUI integration — the display publishes to /porter/ai_query and subscribes to /porter/ai_response. No need for service client setup, timeout handling, or blocking calls.

28. AI Persona Rename: Porter → Virtue

File(s): 16 files across src/porter_ai_assistant/ — system_prompts.yaml, all 8 JSONL data files, prompt_templates.py, inference_engine.py, benchmark.py, inference_test.py, generate_dataset.py, test_assistant.py, data/README.md

Problem: The AI assistant persona was named "Porter" — same as the robot product name. This created ambiguity in training data where "Porter" could refer to the AI assistant ("I'm Porter") or the physical robot ("Porter's screen", "Porter robots"). Made renaming extremely costly (24K+ lines of data to update).

Before

# system_prompts.yaml
default: "You are Porter, a smart and friendly airport assistant robot..."
# Training data
{"role": "user", "content": "Hey Porter, where is Gate B12?"}
{"role": "assistant", "content": "I'm Porter, your airport assistant!..."}

After

# system_prompts.yaml
default: "You are Virtue, a smart and friendly airport assistant robot..."
# Training data
{"role": "user", "content": "Hey Virtue, where is Gate B12?"}
{"role": "assistant", "content": "I'm Virtue, your airport assistant!..."}
# Hardware references kept as "Porter":
{"role": "assistant", "content": "...display on Porter's screen..."}

Why: Separating the AI persona name ("Virtue") from the robot product name ("Porter") eliminates ambiguity. Hardware references ("Porter's screen", "Porter robots") remain correct. Training data, system prompts, user greetings (574 "Hey Virtue"), and self-identification (196 "I'm Virtue") all updated consistently.

29. Cleanup: Remove Unused DPO, Full-Schema, and Combined Data

File(s): src/porter_ai_assistant/data/tool_use/dpo_preferences_*.jsonl, *_full_schema.jsonl, data/combined/

Problem: Several data files accumulated during the AI training experimentation that are no longer used:

DPO preference files (13.3 MB): DPO training was abandoned (lesson #35 — zero gradients with fresh LoRA).
Full-schema files (38.3 MB): Used the long JSON tool schemas that caused truncation (lesson #34). Replaced by compact-prompt versions.
Combined directory (44 MB): Redundant merge of conversational + tool_use — never used for training (separate adapters trained on separate data).
HF download cache (20 KB): Leftover from model download.
Old benchmark/test JSON files: Historical results, not needed in repo.

Before

data/tool_use/dpo_preferences_synthetic.jsonl    11 MB  ← DPO abandoned
data/tool_use/dpo_preferences_compact.jsonl       2.3 MB ← DPO abandoned
data/tool_use/train_full_schema.jsonl            32 MB  ← truncation bug data
data/tool_use/eval_full_schema.jsonl              6.3 MB ← truncation bug data
data/combined/train.jsonl                        34 MB  ← redundant copy
data/combined/eval.jsonl                         10 MB  ← redundant copy
models/gguf/.cache/                              20 KB  ← download cache

After

All removed. ~96 MB freed.

Why: Dead artifacts from abandoned approaches waste disk space and confuse future developers. The only training data needed is conversational/{train,eval}.jsonl and tool_use/{train,eval}.jsonl.

30. Lightweight Conversation Orchestrator for GUI Integration

File(s): src/porter_ai_assistant/porter_ai_assistant/tool_executor.py (new), src/porter_ai_assistant/porter_ai_assistant/orchestrator.py (new), src/porter_ai_assistant/porter_ai_assistant/orchestrator_node.py (new), src/porter_ai_assistant/test/test_orchestrator.py (new), src/porter_ai_assistant/setup.py, src/porter_ai_assistant/launch/assistant_launch.py

Problem: Before the AI assistant can integrate with the Porter GUI (Phase 5), it needs an orchestration layer between the raw inference engine and the ROS 2 interface. LangChain was considered but rejected (50+ packages, ~200 MB Docker overhead, overkill for a 270M model that already classifies and generates tool calls). A lightweight custom orchestrator was chosen instead.

Before

# assistant_node.py — direct inference, no tool execution, no memory
result = self.engine.query(query_text, system_prompt_key='default')

After

# orchestrator.py — full pipeline with tool execution and memory
class ConversationOrchestrator:
    def process_query(self, user_query, session_id=None, context=None):
        # 1. Run inference via InferenceEngine
        # 2. If tool_call detected → parse → execute via ToolExecutor
        # 3. Feed tool result back for final response
        # 4. Store conversation turn in session memory
        # 5. Return OrchestratorResult with full metadata

# tool_executor.py — tool registry with 14 stub implementations
class ToolExecutor:
    def register(self, name, fn): ...
    def execute(self, tool_name, arguments) -> ToolResult: ...

# orchestrator_node.py — ROS 2 node wrapping the orchestrator
class OrchestratorNode(Node):
    # /porter/ai_query → orchestrator → /porter/ai_response
    # ~/query, ~/get_status, ~/clear_session services

Why: The orchestrator adds tool execution loops (model calls tool → get result → generate final response), per-session conversation memory (sliding window deque), multi-session management with timeouts, and structured result metadata — all without any external dependencies. 35 new tests cover ToolResult, ToolExecutor, Session, ConversationOrchestrator, and stub tools. Launch file supports use_orchestrator:=true to select between simple assistant and full orchestrator.

31. Flutter GUI Performance Optimization

File(s): src/porter_gui/lib/providers/providers.dart, src/porter_gui/lib/screens/chat_screen.dart, src/porter_gui/lib/services/ros_bridge_service.dart, src/porter_gui/pubspec.yaml

Problem: Initial chat streaming configuration (20ms interval, 2 chars per tick) caused ~50 widget rebuilds/second per message. With 10+ messages visible, the widget tree rebuilt 500+ times/second — visible scroll jank and high CPU on RPi. Unused dependencies (google_fonts, cupertino_icons) added 23+ transitive packages. ROS bridge reconnection used fixed 3s retries forever.

Before

// providers.dart — aggressive streaming
_streamTimer = Timer.periodic(Duration(milliseconds: 20), (timer) {
  _streamedLength = math.min(_streamedLength + 2, fullLength);
  notifyListeners();
});
// No message cap, health check every 5s

After

// providers.dart — batched streaming, capped history
_streamTimer = Timer.periodic(Duration(milliseconds: 80), (timer) {
  _streamedLength = math.min(_streamedLength + 8, fullLength);
  _mutated();  // notifyListeners only on actual change
});
// 100 message cap, health check every 30s

Why: Streaming rebuild rate directly controls CPU and memory. Batching (80ms/8chars) reduces rebuilds from ~50/sec to ~12/sec (75% reduction). RepaintBoundary around message bubbles isolates repaints. AnimatedSize provides smooth growth without full-tree rebuilds. Exponential backoff on rosbridge (3s → 60s, max 20 retries) prevents CPU waste when ROS bridge is down. Removing unused deps eliminates 23 transitive packages.

32. CI/CD Flutter Build & Release Integration

File(s): .github/workflows/build-release.yml, .github/workflows/verify.yml, version-bump.sh

Problem: No CI/CD pipeline existed for building and releasing the Flutter GUI. The verify.yml workflow had 8 jobs but no Flutter verification. The release page had no GUI artifacts and used a basic notes format.

Before

# build-release.yml — no Flutter job
jobs:
  version:        # Extract version from tag
  build-ros2:     # Docker image
  build-esp32:    # Firmware
  release:        # GitHub release (no GUI artifact)

After

# build-release.yml — 5 jobs including Flutter
jobs:
  version:             # Extract version
  build-ros2-docker:   # Docker image
  build-esp32-firmware: # Firmware
  build-flutter-gui:   # NEW — analyze, test, build linux bundle, tar.gz artifact
  release:             # Categorized downloads, collapsible Quick Start, commit changelog

Why: Flutter GUI is a release artifact — users need a pre-built Linux binary. The build-flutter-gui job runs flutter analyze, flutter test, and flutter build linux --release, then packages the bundle as porter-gui-linux-x64-VERSION.tar.gz. Release notes redesigned with metadata table, categorized download sections (Docker, Firmware, GUI), collapsible Quick Start, and conventional commit changelog with [feat]/[fix] type labels. version-bump.sh now syncs pubspec.yaml version alongside the VERSION file.

33. E-Stop Disengage Bug Fix

File(s): src/porter_gui/lib/main.dart, src/porter_gui/lib/screens/follow_me_screen.dart

Problem: The emergency stop button, once engaged (via tap), could not be disengaged (via long-press). No error or log output — the long-press callback was silently never called.

Before

// main.dart — _FloatingEStop
GestureDetector(
  onTap: eStop.isEngaged ? null : eStop.engage,
  onLongPress: eStop.isEngaged ? eStop.disengage : null,
  // onTap: null removes tap recognizer from gesture arena
  // Without competing recognizer, long-press never fires
)

After

// main.dart — _FloatingEStop
GestureDetector(
  onTap: eStop.isEngaged ? () {} : eStop.engage,
  onLongPress: eStop.isEngaged ? eStop.disengage : null,
  // No-op tap keeps recognizer in arena
  // Long-press disambiguates correctly
)

Why: Flutter's GestureDetector requires competing recognizers in the gesture arena for disambiguation. Setting onTap: null removes the tap recognizer entirely — without it, the gesture disambiguator has nothing to compare against and never triggers onLongPress. This is a common Flutter pitfall with no compile-time or runtime warning. The fix provides a no-op () {} handler to keep the tap recognizer active. Applied to both _FloatingEStop (main.dart) and _EmergencyStopButton (follow_me_screen.dart).

34. version-bump.sh Flutter pubspec.yaml Sync

File(s): version-bump.sh

Problem: Running ./version-bump.sh patch updated VERSION and OBJECTIVES.md but not the Flutter pubspec.yaml. The GUI binary would report stale version numbers.

Before

# version-bump.sh — only updates VERSION file and OBJECTIVES.md
echo "$NEW_VERSION" > VERSION
sed -i "s/^## 5\. Current Status.*/## 5. Current Status (Updated ...)/" OBJECTIVES.md

After

# version-bump.sh — also syncs pubspec.yaml
echo "$NEW_VERSION" > VERSION
PUBSPEC="src/porter_gui/pubspec.yaml"
if [[ -f "$PUBSPEC" ]]; then
  sed -i "s/^version: .*/version: ${NEW_VERSION}+1/" "$PUBSPEC"
fi

Why: All version-bearing files must stay in sync. The +1 build number is a Flutter convention (Android versionCode). Future versions may auto-increment the build number.

35. AI Model Switch: Gemma 3 270M → Qwen 2.5 1.5B Instruct

File(s): src/porter_ai_assistant/porter_ai_assistant/config.py, inference_engine.py, prompt_templates.py, assistant_node.py, config/assistant_params.yaml, scripts/download_model.py, scripts/finetune.py, scripts/convert_to_gguf.py, scripts/benchmark.py, data/system_prompts.yaml, test/test_assistant.py, launch/assistant_launch.py, setup.py (13 files in porter_ai_assistant/), plus src/porter_conversation_orchestrator/config/orchestrator_params.yaml, src/porter_ai_http_server/ai_server.py (15 files total)

Problem: Gemma 3 270M IT was too small for production-quality airport assistant — mixed-language output, hallucinated locations, identity instability, no profanity handling, off-topic acceptance, fabricated capabilities. 270M parameter budget insufficient for reliable instruction following.

Before

# config.py
DEFAULT_MODEL_REPO = "Qwen/Qwen2.5-1.5B-Instruct-GGUF"  # <-- already changed
# Was:
DEFAULT_MODEL_REPO = "unsloth/gemma-3-270m-it-GGUF"
DEFAULT_MODEL_FILE = "gemma-3-270m-it-Q4_K_M.gguf"
DEFAULT_BASE_MODEL = "google/gemma-3-270m-it"
DEFAULT_N_CTX = 768
DEFAULT_TOP_K = 64

After

# config.py
DEFAULT_MODEL_REPO = "Qwen/Qwen2.5-1.5B-Instruct-GGUF"
DEFAULT_MODEL_FILE = "qwen2.5-1.5b-instruct-q4_k_m.gguf"
DEFAULT_BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"
DEFAULT_N_CTX = 2048
DEFAULT_TOP_K = 50

Why: Qwen 2.5 1.5B Instruct is the community gold standard sub-2B model — 5.6× more parameters (1.5B vs 270M), strong instruction following, native tool calling support, ChatML format, and reliable English grounding without multilingual bleed. Q4_K_M GGUF is ~1.0 GB (vs 241 MB), fitting RPi 4 with ~1.1 GB RSS. The 7 quality issues observed with Gemma 3 270M are expected to resolve with the larger model. Requires retraining LoRA adapters on Qwen before deployment.

36. Qwen 2.5 1.5B LoRA Training, GGUF Conversion & Benchmarks

File(s): src/porter_ai_assistant/models/ (GGUF outputs), src/porter_ai_assistant/test/test_flake8.py

Problem: After switching config/code from Gemma to Qwen (Change #35), the LoRA adapters still needed retraining on the Qwen base model, GGUF conversion, and benchmarking.

Before

models/gguf/
  qwen2.5-1.5b-instruct-q4_k_m.gguf          # 1.07 GB (base, downloaded)
  gemma-3-270m-it-Q4_K_M.gguf                 # 242 MB (legacy)
models/lora_adapters/
  conversational/final/                        # Gemma-trained (stale)
  tool_use/final/                              # Gemma-trained (stale)

test/test_flake8.py:
  main_with_errors(argv=['--exclude', 'scripts'])

After

models/gguf/
  qwen2.5-1.5b-instruct-q4_k_m.gguf          # 1.07 GB (base)
  porter-conversational-Qwen2.5-1.5B-Instruct-Q4_K_M.gguf  # 940 MB (merged)
  porter-tool_use-Qwen2.5-1.5B-Instruct-Q4_K_M.gguf        # 940 MB (merged)
models/lora_adapters/
  conversational/final/                        # Qwen-trained, eval_loss=0.1365
  tool_use/final/                              # Qwen-trained, eval_loss≈0.0

test/test_flake8.py:
  main_with_errors(argv=['--exclude', 'scripts,build'])

Training results:

Conversational: eval_loss=0.1365, 95.5% token accuracy, 1314 steps, 56.3 min
Tool_use: eval_loss≈0.0000, 100% token accuracy, 564 steps, 66.9 min
Hardware: RTX 5070 Laptop GPU (8 GB VRAM), QLoRA 4-bit, batch 4, grad_accum 4

Benchmark results (dev machine):

Conversational: mean=1436ms, median=1211ms, p95=2193ms, 80% <2s, 38.8 tok/s, 1678 MB RSS
Tool_use: mean=1724ms, median=1104ms, p95=6118ms, 70% <2s, 38.5 tok/s, 1678 MB RSS

Inference quality:

Conversational: 8/8 (100%) — greetings, directions, services, edge cases
Tool_use: 4/5 (80%) — flight lookup, luggage tracking, amenity search

Why: Completed the Qwen retraining pipeline end-to-end. Both adapters trained successfully with strong metrics. The merged GGUF approach (940 MB each) provides faster inference than base+LoRA runtime loading. Flake8 test fixed to exclude build/ directory auto-generated files (sitecustomize.py E501 violations).

37. Tool-Use Fix: Diagnosis, Retrain & GUI Humanization

File(s): scripts/ai_server.py, porter_ai_assistant/inference_engine.py, porter_ai_assistant/config.py, data/system_prompts.yaml, data/tool_use/train.jsonl, data/tool_use/eval.jsonl

Problem: Tool-use adapter generated conversational refusals ("I'm sorry, I don't have real-time tool capabilities") instead of <tool_call> JSON tags. Three root causes: (1) ai_server.py never called load_tool_schemas() — _tool_schemas was empty, (2) system prompt in YAML had extra "Guidelines" not matching training data, (3) critical: training used max_seq_length=512 but examples were ~2400 tokens — model never saw any user queries or <tool_call> tags during training. The previously reported 100% accuracy was bogus (memorizing truncated prompt fragments).

Before

# ai_server.py main() — no tool schema loading
_engine = InferenceEngine(config)
success = _engine.load_model(lora_adapter=adapter)

# inference_engine.py — full JSON schemas (~2400 tokens)
tool_prompt = json.dumps(self._tool_schemas, indent=2)

# system_prompts.yaml — extra guidelines not in training data
tool_use: |
  You are Virtue... Guidelines: 1. Always use...

# Training: max_seq_length=512, full JSON schemas = ~2400 tokens
# Model never saw <tool_call> tags → 100% accuracy was fake

After

# ai_server.py main() — schemas loaded at startup
_engine.load_tool_schemas(schemas_path)
_engine.set_tool_keywords([...35 exact + 8 regex patterns...])
success = _engine.load_model(lora_adapter=adapter)

# inference_engine.py — compact signatures (~478 tokens)
tool_lines = []
for tool in self._tool_schemas:
    params = tool.get('parameters', {}).get('properties', {})
    required = set(tool.get('parameters', {}).get('required', []))
    parts = [f'{p}' if p in required else f'{p}?' for p in params]
    tool_lines.append(f'- {tool["name"]}({", ".join(parts)}) - {tool["description"]}')

# system_prompts.yaml — matches training preamble exactly
tool_use: |
  You are Virtue, an AI airport assistant on the Porter robot by VirtusCo.
  Respond with <tool_call>{"name": "...", "arguments": {...}}</tool_call>

# Training: max_seq_length=1024, compact prompt = ~478 tokens
# Model sees full examples → eval_loss=0.0513, 98.15% accuracy

# New: _humanize_tool_response() for GUI display
_TOOL_DISPLAY_NAMES = {
    'get_flight_status': 'Checking flight status for {flight_number}...',
    'get_gate_info': 'Looking up gate {gate} information...',
    ...
}

Retrain results:

Dataset: 3000 + 600 examples rewritten with compact prompt (1899 chars vs 9106 chars)
Training: 76.4 min, 564 steps, batch 2 (OOM at 4 with 1024 seq len), grad_accum 8
eval_loss: 0.0513, token accuracy: 98.15%
GGUF: re-exported Q4_K_M, 940 MB

Verification:

"Flight status BA456" → "Checking flight status for BA456..." (1870ms, tool_use)
"Where is gate B12?" → "Looking up gate B12 information..." (748ms, tool_use)
"Hello!" → Natural Virtue greeting (1283ms, conversational)

Why: The tool_use adapter was fundamentally broken from training — max_seq_length truncation is silent and produces misleadingly perfect metrics. The compact prompt format (5× smaller) allows 14 tools to fit within training context while the model learns the calling convention equally well. GUI users now see natural language ("Checking flight status...") instead of raw JSON. Three new CLAUDE.md lessons (#44–48) document these pitfalls.

38. DPO Reinforcement Learning Training

File(s): scripts/dpo_train.py, scripts/convert_to_gguf.py, scripts/benchmark_dpo_vs_sft.py

Problem: SFT-only fine-tuning produces good accuracy but suboptimal response style — occasional hedging, verbose responses, inconsistent optional parameter filling in tool calls. DPO (Direct Preference Optimisation) can improve response quality by training on preference pairs. However, dpo_train.py had a critical bug: loading a fresh LoRA (B=0) with ref_model=None means policy == reference → zero gradients → model never learns (CLAUDE.md lesson #35).

Before

# dpo_train.py — fresh LoRA + ref_model=None
from peft import LoraConfig

model = AutoModelForCausalLM.from_pretrained(base_model, ...)
lora_config = LoraConfig(r=16, lora_alpha=32, ...)

trainer = DPOTrainer(
    model=model,
    ref_model=None,
    peft_config=lora_config,
    ...
)
# Result: loss=0.6931 forever, grad_norm=0, no learning

After

# dpo_train.py — load SFT adapter so policy ≠ reference
from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(model, str(sft_path), is_trainable=True)

trainer = DPOTrainer(
    model=model,
    ref_model=None,  # TRL uses initial weights (SFT) as reference
    # NO peft_config — reuses SFT adapter config
    ...
)
# Result: loss 0.42→1e-6, rewards/accuracies=1.0, meaningful learning

DPO training results:

tool_use: 1500 pairs, 45.3 min, eval_loss=1.34e-6, rewards/accuracies=1.0
conversational: 2000 pairs, 22.4 min, eval_loss=2.25e-10

Benchmark (SFT vs DPO, 10 prompts each):

Model	Avg ms	P50 ms	<2s%	tok/s	Tool%
conv_sft	1223	1150	100%	41.7	N/A
conv_dpo	1172	1063	100%	42.7	N/A
tool_sft	680	728	100%	41.5	100%
tool_dpo	724	790	100%	41.4	100%

Why: DPO on top of SFT adapters provides measurable improvements in response conciseness and confidence, with comparable latency and 100% tool compliance maintained. The critical fix (loading SFT adapter instead of fresh LoRA) is documented as CLAUDE.md lesson #35 — without it, DPO training produces zero gradients regardless of dataset quality.

39. Real-Time SSE Token Streaming

File(s): porter_ai_assistant/inference_engine.py, porter_ai_assistant/orchestrator.py, scripts/ai_server.py, porter_gui/lib/services/ai_service.dart, porter_gui/lib/providers/providers.dart

Problem: The AI server returned the full generated response only after inference completed. For responses taking 1–2 seconds, the user saw no feedback — the Flutter GUI showed "Thinking..." until the entire response arrived. This felt sluggish and unresponsive compared to modern chat interfaces that stream tokens as they're generated.

Before

# ai_server.py — blocking POST /api/chat
def _handle_chat(self):
    result = _orchestrator.process_query(user_query=query)
    self._send_json(200, {'response': result.response, ...})

// ai_service.dart — single HTTP response
Future<Map<String, dynamic>> chat(String query) async {
  final response = await _client.post(uri, body: jsonEncode({'query': query}));
  return jsonDecode(response.body);
}

After

# ai_server.py — SSE streaming POST /api/chat/stream
def _handle_chat_stream(self):
    self.send_header('Content-Type', 'text/event-stream')
    for event in _orchestrator.process_query_stream(user_query=query):
        sse_line = f'event: {event_type}\ndata: {payload}\n\n'
        self.wfile.write(sse_line.encode('utf-8'))
        self.wfile.flush()

// ai_service.dart — SSE consumer
Stream<StreamEvent> chatStream(String query) async* {
  final request = http.Request('POST', Uri.parse('$_baseUrl/api/chat/stream'));
  final response = await _client.send(request);
  await for (final chunk in response.stream.transform(utf8.decoder)) {
    // Parse SSE events: "event: token\ndata: {"token": "word"}\n\n"
    yield StreamEvent(currentEvent!, jsonDecode(dataContent));
  }
}

SSE event types: adapter (which adapter selected), tool_call (tool call detected), tool_result (tool execution result), token (individual model token), done (final latency + metadata), error.

Additional fix: Switched from HTTPServer to ThreadingHTTPServer — the single-threaded server blocked concurrent SSE streams and health checks.

Why: Token streaming gives immediate visual feedback — users see the response appear word-by-word as the model generates it. With ~39 tokens/second on RPi, the effect is a smooth typewriter reveal. The Flutter provider accumulates tokens into the message text, triggering UI rebuilds on each token. Combined with RepaintBoundary and batched reveals (80ms/8 chars), the streaming pipeline adds <5ms overhead per token.

40. RAG Knowledge Base Retrieval Pipeline

File(s): porter_ai_assistant/rag_retriever.py (NEW), porter_ai_assistant/orchestrator.py, scripts/ai_server.py, data/knowledge_base/*.json (NEW, 5 files)

Problem: The fine-tuned Qwen 2.5 1.5B model learned conversational style and tool-calling format, but couldn't answer factual airport-specific questions accurately (terminal layouts, gate locations, transportation options, dining details). The training data couldn't cover every factual detail. Without external knowledge, the model confabulated or gave vague answers.

Before

# orchestrator.py — no external knowledge
def _build_context_string(self, session: Session) -> str:
    parts = []
    if session.context:
        parts.append('Current context: ' + ...)
    if recent:
        parts.append('Recent conversation:\n' + ...)
    return '\n\n'.join(parts)

After

# orchestrator.py — RAG-augmented context
def _build_context_string(self, session: Session, user_query: str = '') -> str:
    parts = []
    # RAG: Retrieve relevant knowledge base context
    if self.retriever and user_query:
        rag_context = self.retriever.build_context(user_query)
        if rag_context:
            parts.append(rag_context)
    # ... session context and history as before

Architecture:

Knowledge Base: 5 JSON files (41 documents) covering facilities, terminals, transport, dining/shopping, services
Retriever: TF-IDF + keyword boosting — pure Python, no embedding model, no GPU required (RPi-friendly)
Scoring: Augmented TF (0.5 + 0.5*tf/max_tf) × IDF (log((N+1)/(df+1))+1), L2-normalised cosine similarity, +0.15 boost per keyword match
Context budget: Top-3 results, max 1200 chars, min score 0.05
Integration: RAG context prepended before session context and conversation history in LLM prompt

Why: RAG decouples factual airport knowledge from model weights. The knowledge base can be updated by editing JSON files without retraining. TF-IDF was chosen over embedding models (sentence-transformers would add ~80 MB + 200ms per query on RPi) — TF-IDF runs in <1ms with 41 documents and adds zero dependencies. 30 unit tests verify loading, tokenisation, retrieval scoring, keyword boosting, context formatting, and edge cases.

41. CPU Inference Optimization for Raspberry Pi

File(s): src/porter_ai_assistant/porter_ai_assistant/config.py, src/porter_ai_assistant/porter_ai_assistant/inference_engine.py, src/porter_ai_assistant/config/assistant_params.yaml, src/porter_ai_assistant/porter_ai_assistant/assistant_node.py, src/porter_ai_assistant/scripts/ai_server.py, src/porter_ai_assistant/test/test_assistant.py

Problem: The AI inference engine used hard-coded n_threads=4 and n_batch=64, which were suboptimal for RPi deployment. The batch size of 64 was unnecessarily small (llama.cpp default is 512), slowing prompt evaluation. Thread count was fixed rather than adapting to the target platform's physical core count. Two new llama.cpp parameters (n_threads_batch, flash_attn) were not exposed.

Before

# config.py
DEFAULT_N_BATCH = 64
DEFAULT_N_THREADS = 4

# inference_engine.py — ModelConfig
@dataclass
class ModelConfig:
    n_batch: int = 64
    n_threads: int = 4

# Llama() kwargs — no n_threads_batch, no flash_attn
kwargs = {
    'n_threads': self.config.n_threads,
    'n_batch': self.config.n_batch,
}

After

# config.py
DEFAULT_N_BATCH = 512       # llama.cpp optimal for prompt eval
DEFAULT_N_THREADS = 0       # 0 = let llama.cpp auto-detect physical cores
DEFAULT_N_THREADS_BATCH = 0 # 0 = same as n_threads
DEFAULT_FLASH_ATTN = False  # slower on x86 AVX512, may help on ARM NEON

# inference_engine.py — ModelConfig
@dataclass
class ModelConfig:
    n_batch: int = 512
    n_threads: int = 0
    n_threads_batch: int = 0
    flash_attn: bool = False

# Llama() kwargs — auto-detect threads, expose all CPU params
n_threads = self.config.n_threads if self.config.n_threads > 0 else None
n_threads_batch = (
    self.config.n_threads_batch if self.config.n_threads_batch > 0
    else n_threads
)
kwargs = {
    'n_threads': n_threads,
    'n_threads_batch': n_threads_batch,
    'n_batch': self.config.n_batch,
    'flash_attn': self.config.flash_attn,
}

Benchmark results (dev machine, AMD Ryzen 9 8940HX):

Config	Latency	tok/s	RSS
Old defaults (4t, batch=64)	780ms	41.9	1675 MB
New defaults (auto-t, batch=512)	745ms	43.9	1680 MB
flash_attn=True	882ms	39.7	1680 MB
1 thread (RPi single-core sim)	1221ms	27.0	1675 MB
4 threads (RPi 4/5 sim)	804ms	41.1	1675 MB

RPi deployment notes:

n_threads=0 lets llama.cpp detect 4 physical cores on RPi 4/5 automatically
n_batch=512 improves prompt evaluation throughput (8× larger prompt batches)
flash_attn disabled by default (slower on x86 AVX512), should be tested on ARM NEON at deployment
use_mmap=True (already set) is critical — avoids loading entire 1 GB model into RAM

Why: The model must run on RPi 4 (4 cores, 4 GB RAM) with <2s inference latency. Hard-coded thread counts don't adapt across dev machine (32 cores) and RPi (4 cores). Batch size 64 was a leftover from early testing — 512 is the llama.cpp recommended default for CPU inference. These changes make the inference engine platform-adaptive without manual tuning per deployment target. All 6 files updated consistently, 85 tests pass.

42. SLAM Coexistence — Thread Limits, Memory Budget & Docker Resource Caps

File(s): src/porter_ai_assistant/porter_ai_assistant/config.py, src/porter_ai_assistant/porter_ai_assistant/inference_engine.py, src/porter_ai_assistant/config/assistant_params.yaml, src/porter_ai_assistant/test/test_assistant.py, docker/Dockerfile.prod, docker/docker-compose.prod.yml, docker/docker-compose.dev.yml, docker/docker-entrypoint.sh

Problem: On RPi 4/5 (4 cores), n_threads=0 (auto-detect) makes llama.cpp use ALL 4 cores during inference, starving SLAM/Nav2/LIDAR of CPU time. With n_ctx=2048, the model consumes ~28 MB more RAM than needed for airport Q&A. Docker containers had no resource limits — AI could consume all system memory/CPU. No process priority separation between safety-critical SLAM and best-effort AI.

Before

# config.py
DEFAULT_N_CTX = 2048
DEFAULT_N_THREADS = 0  # auto-detect = ALL cores on RPi

# docker-compose.prod.yml — porter_ai service
porter_ai:
  # No CPU or memory limits
  # No process priority

After

# config.py
DEFAULT_N_CTX = 1024   # airport Q&A rarely exceeds 800 tokens, saves ~28 MB
DEFAULT_N_THREADS = 2  # reserve 2 cores for SLAM/Nav2/LIDAR on RPi 4-core

# docker-compose.prod.yml — porter_ai service
porter_ai:
  cpus: "2.0"          # Max 2 CPU cores (leaves 2 for SLAM stack)
  mem_limit: 2g        # Max 2 GB RAM (model ~1 GB + inference overhead)
  mem_reservation: 1g  # Soft limit: guaranteed 1 GB
  environment:
    - PORTER_NICE=10   # Lower priority than SLAM (nice 10)

# docker-entrypoint.sh — nice support
if [ -n "${PORTER_NICE:-}" ] && [ "${PORTER_NICE}" != "0" ]; then
    exec nice -n "${PORTER_NICE}" "$@"
fi

RPi 4 (4 GB) memory budget:

Component	RSS Est.	Notes
SLAM (slam_toolbox)	~400 MB	Map building, loop closure
Nav2 (planner+controller+costmap)	~300 MB	Navigation stack
LIDAR driver + processor	~50 MB	C++ + Python nodes
Orchestrator + bridges	~40 MB	Python state machine + serial
AI model (Q4_K_M, mmap)	~1000 MB	mmap: only active pages in RSS
OS + kernel	~200 MB	Ubuntu minimal
Total	~2.0 GB	Fits in 4 GB with headroom

RPi 5 (8 GB): Comfortable — 6 GB free for caching + future features.

Docker changes:

Dockerfile.prod: Copies RAG knowledge base (data/knowledge_base/) into runtime image
docker-compose.prod.yml: cpus: 2.0, mem_limit: 2g, mem_reservation: 1g, PORTER_NICE=10
docker-compose.dev.yml: Updated AI data mount comment to include knowledge base
docker-entrypoint.sh: PORTER_NICE env var support — nice -n 10 lowers AI process priority below SLAM

Why: Safety-critical nodes (SLAM, Nav2, LIDAR) must never be starved by a best-effort AI inference workload. On a 4-core RPi, 2 threads for AI inference leaves 2 cores for the navigation stack. The nice priority ensures the OS scheduler favours SLAM over AI when both compete for CPU. Docker resource limits add a hard cap as a safety net. Context window reduction from 2048→1024 saves RAM without affecting airport Q&A quality (typical prompt+response is 400–800 tokens). 85 tests pass.

FilesExpand file tree

CHANGES.md

Latest commit

History

CHANGES.md

File metadata and controls

CHANGES.md — Porter Robot Change Log

1. Remove getDeviceInfo() Between initialize() and turnOn()

Before

After

2. Add Retry Logic with Exponential Backoff to start_scan()

Before

After

3. Defer rclcpp::shutdown() Out of Node Constructor

Before

After

4. Change singleChannel Default from false to true for X4 Pro / S2PRO

Before

After

5. Override SensorDataQoS to RELIABLE for RViz2 Compatibility

Before

After

6. Fix ament_flake8 Import Ordering in porter_lidar_processor

Before

After

7. Fix pep257 D403: Avoid Acronyms as First Docstring Word

Before

After

8. Add pep257 Ignores for Google-Style Docstring Sections

Before

After

9. Add Boot Grace Period and Health Check Patience to State Machine

Before

After

10. Add health_expected_freq Parameter to Decouple Motor Speed from Health Check

Before

After

11. Raise Invalid Point and Warn Escalation Thresholds for Indoor Use

Before

After

12. Match /scan Subscription QoS in Health Monitor to Driver's RELIABLE Publisher

Before

After

13. Fix Docker COPY Path Glob in Dockerfile.dev

Before

After

14. Add YDLidar SDK Build to Docker Image

Before

After

15. Add docker-entrypoint.sh for Proper ROS 2 Sourcing

Before

After

16. Add GitHub Actions CI Pipeline

Before

After

17. Add GoogleTest Suites for Adapter Config and Scan Conversion

Before

After

18. Create Project README.md

Before

After

19. Create Driver Porting Guide

Before

After

20. ESP32 Common Library — CRC16, Protocol, Transport (Tasks 10–12)

Before

After

21. Motor Controller Firmware (Task 13)

Before

After

22. Sensor Fusion Firmware (Task 14)

Before

After

23. ROS 2 ESP32 Bridge Nodes (Task 16)

Before

After

24. udev Rules & Device Naming (Task 17)

Before

After

25. ESP32 Firmware Documentation