Skip to content

operator_wrong_operator_image firing weird alerts #712

@Saadmrp1038

Description

@Saadmrp1038

When I run the operator_wrong_operator_image problem, I get 4 alerts with no namespace attached and one alert with the hotel-reservation namespace. I don't have the hotel-reservation namespace deployed!

I added a logger to print the firing alerts and their namespaces. Since none of the alerts match the expected namespace (tidb-cluster), they get filtered out. As a result, oracle says [WAIT] No alerts yet and keeps waiting till it times out.

run logs

INFO - all.sregym.conductor - [ENV] Injected fault
INFO - all.sregym.conductor - [WAIT] Waiting for alerts to fire in namespace 'tidb-cluster' (timeout=600s)…
INFO - sregym.conductor.oracles.alert_oracle - All Prometheus alerts (5 total, filtering for namespace='tidb-cluster'):
INFO - sregym.conductor.oracles.alert_oracle -    [firing] HTTPProbeFailure — namespace=hotel-reservation
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBInstanceNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle - All Prometheus alerts (5 total, filtering for namespace='tidb-cluster'):
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBInstanceNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] HTTPProbeFailure — namespace=hotel-reservation
INFO - sregym.conductor.oracles.alert_oracle - All Prometheus alerts (5 total, filtering for namespace='tidb-cluster'):
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBInstanceNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] HTTPProbeFailure — namespace=hotel-reservation
INFO - sregym.conductor.oracles.alert_oracle - All Prometheus alerts (5 total, filtering for namespace='tidb-cluster'):
INFO - sregym.conductor.oracles.alert_oracle -    [firing] HTTPProbeFailure — namespace=hotel-reservation
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBInstanceNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - sregym.conductor.oracles.alert_oracle -    [firing] TiDBStatefulSetNotReady — namespace=None
INFO - all.sregym.conductor - [WAIT] No alerts yet — 38/600s elapsed

kubernetes namespaces...hotel-reservation is not there

$ kubectl get namespaces
NAME              STATUS   AGE
default           Active   6d5h
fleetcast         Active   4m17s
ingress-nginx     Active   18m
kube-flannel      Active   6d4h
kube-node-lease   Active   6d5h
kube-public       Active   6d5h
kube-system       Active   6d5h
observe           Active   6d4h
openebs           Active   6d4h
sregym            Active   6d4h
tidb-cluster      Active   16m
tidb-operator     Active   16m

A more detailed alert print shows that the hotel-reservation alert is being fired from ingress-nginx. But this namespace was deployed by the fleetcast app. I am not sure the alert it is firing get tagged as hotel-reservation.

more detailed alerts

  [?] TiDBInstanceNotReady [firing]
    Detail:    TiDB instance basic-tidb.tidb-cluster.svc.cluster.local:10080 has been unreachable or has no connections for more than 2 minutes.

  [?] TiDBStatefulSetNotReady [firing]
    Detail:    StatefulSet basic-pd has 1 ready replicas but desired is higher

  [?] TiDBStatefulSetNotReady [firing]
    Detail:    StatefulSet basic-tikv has 1 ready replicas but desired is higher

  [?] TiDBStatefulSetNotReady [firing]
    Detail:    StatefulSet basic-tidb has 1 ready replicas but desired is higher

  [hotel-reservation] HTTPProbeFailure [firing]
    Detail:    HTTP probe to http://ingress-nginx-controller.ingress-nginx/api/hotels?inDate=2015-04-09&outDate=2015-04-10&lat=38.0235&lon=-122.095 in hotel-reservation is failing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions