Conversation
Merger script added. TODO: fix it to work with the correct tree
For some reason, the file was not committed correctly. Re-doing the commit-
Used pyhton subprocess.call() to automatically generate the git diff and save it in /tmp/ with a random filename
The ruleset checker seems to be working: it downloads and unzips the Alexa Top1M and automatically generates the git diff. Comparing the two seems to be working as well. Manually checked some rules: they seem to have been correctly identified as in the Top 1M and not in stable
Rule limit was implemented using csvReader.max_lines and breaking the loop when it's hit Implemented tecognition of edited rules via the "M" flag of git diff. Output was tidied up a bit to account for the different wording (used tabulation)
|
Thanks! I get this error: I think it's due to Python mishandling non-ASCII characters in filenames (which we really should get rid of in the future). |
|
Hey Yan! That's weird :/ there's a try/catch there that handles those... It happened to me as well, that's why I put in an "except I'll look into it ASAP, and try to handle the filename encoding.. It might Thanks! Claudio
|
In Python < 3.3.4 the FileNotFound exception is not present. As some filenames have weird encodings, and it's too problematic to fix the code to consider those, the FileNotFound error approach has been mirrored to work with Python < 3.3.4
|
Hey Yan, I tried to solve the encoded filenames problem, but it's too messy; it'd be best if we could rename the filenames... What do you say? |
|
Thanks! Worked fine once I upgraded to 3.3 |
|
No problem! Glad I could help! On Mon, Apr 7, 2014 at 11:09 PM, Yan Zhu [email protected] wrote:
|
This is the "Alexa ruleset merger" script which, in its current state, pulls the official Alexa Top1M websites, extracts the first 1000 websites (the number of websites is configurable) and outputs a list of rules that were modified between stable and master and rules that are in master but not in stable.
It works by checking every new and modified rule (list generated using git diff) for targets that match the website in the Alexa list.