Inconsistency With Remote Predictions in Anomaly Detectors

In datasets where there are a significant number of categorical missing values, local predictions for anomaly detectors are dramatically different from remote predictions:
```
In [3]: from bigml.anomaly import Anomaly                                       

In [4]: from bigml.api import BigML                                             

In [5]: data = csvload('shapsplain/lc.csv')                                     

In [6]: jt = jload('shapsplain/testtree.json')                                  

In [7]: rid = jt['resource']                                                    

In [8]: api = BigML()                                                           

In [9]: api.create_anomaly_score(rid, data[1])['object']['score']               
Out[9]: 0.47451

In [10]: local = Anomaly({'object': jt, 'resource': jt['resource']})            

In [11]: local.anomaly_score(data[1])                                           
Out[11]: 0.6618801118575603
```
Specifically, the issue appears to be here:

https://github.com/bigmlcom/python/blob/master/bigml/predicate.py#L242

The above tree has predicates on categorical variables with the `in` operator.  If there's a `null` in the input data and a `null` in the set of true values, this predicate still evaluates to false, as we drop into the condition above, which relies on the `.missing` attribute of the predicate, which does not get set in this case.

The solution appears to be to set it, up here:

https://github.com/bigmlcom/python/blob/master/bigml/predicate.py#L136

Something like an additional clause:

```python
...
        elif operation == 'in' and None in value:
            self.missing = True
```

When I do that, consistency is restored:

```In [1]: from bigml.anomaly import Anomaly                                       

In [2]: data = csvload('shapsplain/lc.csv')                                     

In [3]: jt = jload('shapsplain/testtree.json')                                  

In [4]: local = Anomaly({'object': jt, 'resource': jt['resource']})             

In [5]: local.anomaly_score(data[1])                                            
Out[5]: 0.4745112731029698
```

I can submit a PR, but I was a bit scared as that's fairly deep in the logic here.  Let me know if that looks like a good solution.

/cc @jaor @unmonoqueteclea 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency With Remote Predictions in Anomaly Detectors #302

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistency With Remote Predictions in Anomaly Detectors #302

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions