-
Notifications
You must be signed in to change notification settings - Fork 179
Inconsistency With Remote Predictions in Anomaly Detectors #302
Description
In datasets where there are a significant number of categorical missing values, local predictions for anomaly detectors are dramatically different from remote predictions:
In [3]: from bigml.anomaly import Anomaly
In [4]: from bigml.api import BigML
In [5]: data = csvload('shapsplain/lc.csv')
In [6]: jt = jload('shapsplain/testtree.json')
In [7]: rid = jt['resource']
In [8]: api = BigML()
In [9]: api.create_anomaly_score(rid, data[1])['object']['score']
Out[9]: 0.47451
In [10]: local = Anomaly({'object': jt, 'resource': jt['resource']})
In [11]: local.anomaly_score(data[1])
Out[11]: 0.6618801118575603
Specifically, the issue appears to be here:
https://github.com/bigmlcom/python/blob/master/bigml/predicate.py#L242
The above tree has predicates on categorical variables with the in operator. If there's a null in the input data and a null in the set of true values, this predicate still evaluates to false, as we drop into the condition above, which relies on the .missing attribute of the predicate, which does not get set in this case.
The solution appears to be to set it, up here:
https://github.com/bigmlcom/python/blob/master/bigml/predicate.py#L136
Something like an additional clause:
...
elif operation == 'in' and None in value:
self.missing = TrueWhen I do that, consistency is restored:
In [2]: data = csvload('shapsplain/lc.csv')
In [3]: jt = jload('shapsplain/testtree.json')
In [4]: local = Anomaly({'object': jt, 'resource': jt['resource']})
In [5]: local.anomaly_score(data[1])
Out[5]: 0.4745112731029698
I can submit a PR, but I was a bit scared as that's fairly deep in the logic here. Let me know if that looks like a good solution.