Skip to content
This repository was archived by the owner on Nov 6, 2023. It is now read-only.

Alexa ruleser checker#149

Merged
diracdeltas merged 11 commits intoEFForg:masterfrom
flyingstar16:master
Apr 7, 2014
Merged

Alexa ruleser checker#149
diracdeltas merged 11 commits intoEFForg:masterfrom
flyingstar16:master

Conversation

@flyingstar16
Copy link
Copy Markdown
Contributor

This is the "Alexa ruleset merger" script which, in its current state, pulls the official Alexa Top1M websites, extracts the first 1000 websites (the number of websites is configurable) and outputs a list of rules that were modified between stable and master and rules that are in master but not in stable.

It works by checking every new and modified rule (list generated using git diff) for targets that match the website in the Alexa list.

Merger script added.
TODO: fix it to work with the correct tree
For some reason, the file was not committed correctly. Re-doing the
commit-
Used pyhton subprocess.call() to automatically generate the git diff and
save it in /tmp/ with a random filename
The ruleset checker seems to be working: it downloads and unzips the
Alexa Top1M and automatically generates the git diff.
Comparing the two seems to be working as well.

Manually checked some rules: they seem to have been correctly identified
as in the Top 1M and not in stable
Rule limit was implemented using csvReader.max_lines and breaking
the loop when it's hit
Implemented tecognition of edited rules via the "M" flag of git
diff.
Output was tidied up a bit to account for the different wording (used tabulation)
@diracdeltas diracdeltas added this to the 4.0 stable milestone Feb 7, 2014
@diracdeltas
Copy link
Copy Markdown
Contributor

Thanks! I get this error:

Traceback (most recent call last):
  File "./alexa-ruleset-checker.py", line 135, in <module>
    ruleText = etree.parse(gitRepositoryPath + ruleFile[1]) # ADJUST FILE PATH (here is '../') IF YOU MOVE THE SCRIPT - XXX: Obsolete warning?
  File "/usr/lib/python3.2/xml/etree/ElementTree.py", line 1223, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.2/xml/etree/ElementTree.py", line 669, in parse
    source = open(source, "rb")
IOError: [Errno 2] No such file or directory: '/home/yan/Documents/efforg/https-everywhere/"src/chrome/content/rules/Bort\\303\\241rsas\\303\\241g.xml"'

I think it's due to Python mishandling non-ASCII characters in filenames (which we really should get rid of in the future).

@flyingstar16
Copy link
Copy Markdown
Contributor Author

Hey Yan!

That's weird :/ there's a try/catch there that handles those...

It happened to me as well, that's why I put in an "except
FileNotFoundError" (or something similar, I can't remember exactly and I'm
using my phone at the moment) to avoid having those... When the exception
is raised, it should ignore it and proceed to the next file...

I'll look into it ASAP, and try to handle the filename encoding.. It might
take a while (I'm on vacation) but I'll sort it out (I hope) :-)

Thanks!

Claudio
Il 07/feb/2014 20:22 "Yan Zhu" [email protected] ha scritto:

Thanks! I get this error:

Traceback (most recent call last):
File "./alexa-ruleset-checker.py", line 135, in
ruleText = etree.parse(gitRepositoryPath + ruleFile[1]) # ADJUST FILE PATH (here is '../') IF YOU MOVE THE SCRIPT - XXX: Obsolete warning?
File "/usr/lib/python3.2/xml/etree/ElementTree.py", line 1223, in parse
tree.parse(source, parser)
File "/usr/lib/python3.2/xml/etree/ElementTree.py", line 669, in parse
source = open(source, "rb")
IOError: [Errno 2] No such file or directory: '/home/yan/Documents/efforg/https-everywhere/"src/chrome/content/rules/Bort\303\241rsas\303\241g.xml"'

I think it's due to Python mishandling non-ASCII characters in filenames
(which we really should get rid of in the future).

Reply to this email directly or view it on GitHubhttps://github.com//pull/149#issuecomment-34490473
.

In Python < 3.3.4 the FileNotFound exception is not present.
As some filenames have weird encodings, and it's too problematic to fix
the code to consider those, the FileNotFound error approach has been
mirrored to work with Python < 3.3.4
@flyingstar16
Copy link
Copy Markdown
Contributor Author

Hey Yan,
I edited the script to catch the IOError; the problem was, I believe, that python3.2 did not have the FileNotFoundError I had with python3.4.4.

I tried to solve the encoded filenames problem, but it's too messy; it'd be best if we could rename the filenames...

What do you say?

@diracdeltas diracdeltas merged commit 421be4b into EFForg:master Apr 7, 2014
@diracdeltas
Copy link
Copy Markdown
Contributor

Thanks! Worked fine once I upgraded to 3.3

@flyingstar16
Copy link
Copy Markdown
Contributor Author

No problem!

Glad I could help!

On Mon, Apr 7, 2014 at 11:09 PM, Yan Zhu [email protected] wrote:

Thanks! Worked fine once I upgraded to 3.3

Reply to this email directly or view it on GitHubhttps://github.com//pull/149#issuecomment-39790140
.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants