Skip to content

Improve pipeline level heuristics for detecting blocking #570

@hellais

Description

@hellais

In terms of blocking detection I think we effectively have 2 classes of blocking phenomenons we can detect:

  1. Those which are potentially “confirmable” via a fingerprint. Basically either dns.inconsistent or http-diff.
    a. For detecting these I think we can improve the heuristic which currently exist in the probe and redo it better in the pipeline. Things that are currently missing from the probe heuristic that are trivial to do are:
    - ground truthing the dns measurement using tls handshake towards inconsistent IPs
    - excluding http-diff measurements when the site is https

    b. Other potential areas for improvement are, but that are a bit trickier:
    - doing PTR lookups for DNS
    - checking if the returned IP matches the probe_asn of the measurement (the idea is that the block IP is most likely on the same network as the probe)
    - using simhash distance to the control page
    - looking for blockpage fingeprints by searching for the pattern “same page but for different unrelated domains”

    c. For these cases, we can implement better heuristics and we can also compute the false positive rate, by re-running it against only the labeled data (i.e. the ones that are confirmed blocked by DNS and/or http fingerprints).

  2. Those which can’t be confirmed via fingerprints. These are cases like dns.nxdomain,tls.connection_reset,tls.timeout, etc.
    a. For these I think we need to be looking at data in aggregate, but in some cases we can get better accuracy through improvements to the measurement. Ex. we can confirm that there is SNI filtering by doing a follow up measurement that checks if a connection is reset also when doing a TLS handshake towards a test helper.

    b. Also here, we can come up with some heuristics that give us some greater level of confidence that the blocking is intentional, but we have the additional challenge that, at this level of granularity, we don’t have the concept of “labeled” data.

Metadata

Metadata

Labels

data qualityDescribes data/measurement quality issuesooni/pipelineIssues related to https://github.com/ooni/pipelinepriority/mediumNormal priority issue

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions