Skip to content

Inconsistency With Remote Predictions in Anomaly Detectors #302

@charleslparker

Description

@charleslparker

In datasets where there are a significant number of categorical missing values, local predictions for anomaly detectors are dramatically different from remote predictions:

In [3]: from bigml.anomaly import Anomaly                                       

In [4]: from bigml.api import BigML                                             

In [5]: data = csvload('shapsplain/lc.csv')                                     

In [6]: jt = jload('shapsplain/testtree.json')                                  

In [7]: rid = jt['resource']                                                    

In [8]: api = BigML()                                                           

In [9]: api.create_anomaly_score(rid, data[1])['object']['score']               
Out[9]: 0.47451

In [10]: local = Anomaly({'object': jt, 'resource': jt['resource']})            

In [11]: local.anomaly_score(data[1])                                           
Out[11]: 0.6618801118575603

Specifically, the issue appears to be here:

https://github.com/bigmlcom/python/blob/master/bigml/predicate.py#L242

The above tree has predicates on categorical variables with the in operator. If there's a null in the input data and a null in the set of true values, this predicate still evaluates to false, as we drop into the condition above, which relies on the .missing attribute of the predicate, which does not get set in this case.

The solution appears to be to set it, up here:

https://github.com/bigmlcom/python/blob/master/bigml/predicate.py#L136

Something like an additional clause:

...
        elif operation == 'in' and None in value:
            self.missing = True

When I do that, consistency is restored:


In [2]: data = csvload('shapsplain/lc.csv')                                     

In [3]: jt = jload('shapsplain/testtree.json')                                  

In [4]: local = Anomaly({'object': jt, 'resource': jt['resource']})             

In [5]: local.anomaly_score(data[1])                                            
Out[5]: 0.4745112731029698

I can submit a PR, but I was a bit scared as that's fairly deep in the logic here. Let me know if that looks like a good solution.

/cc @jaor @unmonoqueteclea

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions