A non-disturbing data collector for the X-Environment (Integrated Control System for High-throughput Tomography). It aggregates control system data from Tango and/or TINE servers, tracks attribute availability, persists downtime intervals to MariaDB, and exposes live metrics via HTTP/Prometheus.
- Java 21
- Maven 3.8+
- Tango 9+ (for Tango attributes) / TINE (for TINE attributes)
- MariaDB 11 (optional — required for downtime persistence)
# Build fat JAR
mvn clean package -Dmaven.test.skip=true
# Start MariaDB (and optional Tango stack)
docker compose up -d status-server-db
# Run
java -jar target/status-server-*.jar path/to/config.xml [http-port]
# http-port defaults to 9190docker run -v /path/to/etc:/app/etc \
-e SS_CONFIG=/app/etc/config.xml \
-p 9190:9190 \
ghcr.io/scientific-software-hub/status-server:latestConfiguration is an XML file. Minimal example:
<status-server stale-after="3" down-after="6">
<devices>
<device name="my/device/1" server="tango://host:10000">
<attributes>
<attribute name="Temperature" poll-delay="1000" interpolation="LINEAR"/>
</attributes>
</device>
</devices>
</status-server>Optional MariaDB section (omit to disable persistence):
<mariadb>
<jdbc-url>jdbc:mariadb://localhost:3306/statusserver</jdbc-url>
<user>ss</user>
<password>ss</password>
</mariadb>| Parameter | Description | Default |
|---|---|---|
stale-after |
Consecutive failures before UP→STALE | 3 |
down-after |
Consecutive failures before STALE→DOWN | 6 |
DeviceSource (XML)
│
▼
EngineFactory ──► Engine
│
┌─────────┴──────────┐
▼ ▼
PollTask EventTask
│ │
▼ ▼
EventSink<SingleRecord<?>> EventSink<TechnicalEvent>
(telemetry) (AvailabilityAnalyzer)
│ │
▼ ▼
InMemoryWriter EventSink<DomainEvent>
(Snapshot only) (EventDispatcher fan-out)
│ │
▼ ┌────┴────┐
MetricsServer logger MariaDbSink
(HTTP /metrics)
EventSink<T> — unified observer interface replacing the old RecordWriter and TechnicalEventListener. Everything that consumes events implements this single generic interface.
EventDispatcher<T> — fan-out dispatcher. Calls all registered sinks, isolates failures per sink.
AvailabilityAnalyzer — consumes TechnicalEvents, maintains per-attribute state machines (UP/STALE/DOWN), emits DomainEvents.
AttributeAvailability — per-attribute state machine:
consecutive failures ≥ stale-after → UP → STALE
consecutive failures ≥ down-after → STALE → DOWN (+ DowntimeOpened)
any success → any → UP (+ DowntimeClosed if from DOWN)
InMemoryWriter — snapshot-only in-memory store backing the /metrics endpoint.
MariaDbSink — persists domain events to MariaDB in ERPNext-compatible tables. Reconnects automatically on failure.
| Endpoint | Description |
|---|---|
GET /metrics |
Prometheus gauge format. Each attribute emits a value gauge and _up (1=healthy, 0=failing). |
GET /health |
Liveness — always 200. |
GET /ready |
Readiness — 503 until engine has started. |
StatusServer classifies each read outcome as a technical event:
| Event | Trigger |
|---|---|
ReadSuccess |
Successful attribute read |
ReadFailure |
Client exception during read |
Timeout |
Read timed out |
Disconnect / Reconnect |
Connection lost / restored |
The AvailabilityAnalyzer aggregates these per attribute and emits domain events when thresholds are crossed:
| Domain Event | Meaning |
|---|---|
AvailabilityTransitioned |
State changed (UP↔STALE↔DOWN) |
DowntimeOpened |
Attribute entered DOWN state |
DowntimeClosed |
Attribute recovered from DOWN |
Three ERPNext-compatible tables (standard tab{DocType} naming, standard audit columns):
tabState Transition — full history of every state change
tabCurrent State — one row per attribute, UPSERT on every transition
tabDowntime Interval — one open row per active downtime, closed on recovery
Apply schema to a fresh database:
# Fresh container (wipes existing data)
docker compose down -v
docker compose up -d status-server-db
# Or apply manually to a running container
docker exec -i status-server-db mariadb -u ss -pss statusserver < db/schema.sqlOn startup, if MariaDB is configured, StatusServer reads tabCurrent State and seeds the in-memory state machines — attributes that were DOWN when the server last stopped resume tracking correctly.
The observability/ directory contains a pre-configured Prometheus + Grafana stack.
# Start the full stack (MariaDB + Prometheus + Grafana)
docker compose up -d status-server-db prometheus grafana
# Or bring everything up at once including the Tango test stack
docker compose up -d| Service | URL | Credentials |
|---|---|---|
| Prometheus | http://localhost:9090 | — |
| Grafana | http://localhost:3000 | admin / admin |
Prometheus scrapes StatusServer at host.docker.internal:9190 every 5 seconds.
Two dashboards are provided in observability/grafana/. Import them via Dashboards → Import → Upload JSON.
Status Server Health (status-server-grafana-dashboard.json)
Operational overview — no time series. Designed for a monitoring wall or quick health check.
┌─────────────────────────────────────────┐
│ Monitored Attributes: N │
├────────────────────┬────────────────────┤
│ Max Signal Age │ Freshness / Source │
├────────────────────┼────────────────────┤
│ Up Attributes │ Failed Attributes │
├────────────────────┴────────────────────┤
│ Device State & Status │
├─────────────────────────────────────────┤
│ Failed Attributes (table) │
└─────────────────────────────────────────┘
Status Server Live Values (status-server-grafana-dashboard-v2.json)
Live signal explorer. Use the Alias variable to drill into a specific attribute.
┌─────────────────────────────────────────┐
│ Current Values Snapshot (table) │
├─────────────────────────────────────────┤
│ All Signal Values (timeseries) │
├─────────────────────────────────────────┤
│ Signal Freshness — All │
├────────────────────┬────────────────────┤
│ $alias — Value │ $alias — Freshness│
├────────────────────┴────────────────────┤
│ Device State & Status │
└─────────────────────────────────────────┘
| Metric | Type | Description |
|---|---|---|
control_system_attribute_value |
gauge | Latest numeric value |
control_system_attribute_state |
gauge | Latest enum state (label state=) |
control_system_attribute_status |
gauge | Latest status text (label status=) |
control_system_attribute_up |
gauge | 1 = last read succeeded, 0 = failed |
control_system_attribute_age_seconds |
gauge | Age of the latest sample |
control_system_attribute_source_timestamp_seconds |
gauge | Source timestamp of latest sample |
status_server_monitored_attributes |
gauge | Total attributes being monitored |
status_server_up_attributes |
gauge | Attributes with a successful last read |
status_server_failed_attributes |
gauge | Attributes with a failed last read |
All per-attribute metrics carry labels: source, device, name, attribute, alias.
# Build (skip environment-dependent tests)
mvn clean package -Dmaven.test.skip=true
# Run unit tests (no Tango/TINE required)
mvn test -Dtest="AvailabilityAnalyzerTest,StatusServerStatusServerConfigurationTest"
# Run all tests (requires live Tango)
mvn testSafe unit tests (no external dependencies): data2/, configuration/, engine2/AvailabilityAnalyzerTest.
Frappe/ERPNext is used as a configuration front-end. It exports device/attribute lists as XML files. StatusServer reads that XML directly. There is no runtime Frappe dependency — this removes the HTTP round-trip on startup and makes the server runnable without an ERPNext instance.
RecordWriter (telemetry) and TechnicalEventListener (availability signals) were merged into a single EventSink<T> functional interface. Rationale: a writer is a listener — it reacts to events. The generic parameter carries the event type, keeping the type system honest while eliminating the duplicate observer hierarchies.
A single generic fan-out dispatcher replaces the telemetry-specific WriterDispatcher. Both the telemetry pipeline (SingleRecord<?>) and the domain event pipeline (DomainEvent) use the same implementation. Failures in one sink are logged and isolated; other sinks always receive the event.
AllRecords (full time-series history) was retired. The in-memory store now keeps only the latest value per attribute (Snapshot). Rationale: historical queries are served by MariaDB; keeping a second growing in-memory copy adds memory pressure with no remaining consumer.
stale-after and down-after are global values configured at the server level, not per attribute. Rationale: in this deployment all attributes are polled at similar rates; per-attribute thresholds add configuration complexity without practical benefit.
Tables follow ERPNext naming (tab{DocType}) and include the standard Frappe audit columns (name, creation, modified, modified_by, owner, docstatus, idx). Rationale: rows can be imported into or consumed by a Frappe/ERPNext instance without transformation, enabling ERP-level downtime reporting and billing workflows.
MariaDbSink uses a single JDBC connection with isValid() check before each use and silent reconnect on failure. Rationale: domain events are low-frequency (state changes, not every poll); a full connection pool (HikariCP etc.) adds a dependency and warm-up complexity for negligible benefit at this throughput.
The HTTP server and engine polling tasks use Thread.ofVirtual(). This allows a large number of concurrent blocking I/O operations (Tango/TINE reads) without the overhead of a large platform thread pool.