Sync to linguist 7.2.0: heuristics.yml support#189
Conversation
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
8999d31 to
73e84fd
Compare
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
Includes only the generated code. Re-generated all `./data/*` heuristics matchers using Github Linguist [e761f9b013e5b61161481fcb898b59721ee40e3d](https://github.com/github/linguist/tree/e761f9b013e5b61161481fcb898b59721ee40e3d) commit - many new languages - better vendoring detection Signed-off-by: Alexander Bezzubov <[email protected]>
Includes: - update to content heuristic generator - generated code in data/content.go to keep commits atomic. Signed-off-by: Alexander Bezzubov <[email protected]>
Includes generated code, to keep commits atomic. Consits of: - code generator for alias produces new API - retrofiting all clients to a new API - generated code data/aliases.go Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
73e84fd to
df7844e
Compare
Signed-off-by: Alexander Bezzubov <[email protected]>
|
Got back from vacation and keep debugging the case of failing Bayesian classifier for content on Seems like this has to do with difference in how linguist and enry tokenize the content. LinguistAdding to def test_classify_sql
results = Classifier.classify(Samples.cache, fixture("SQL/drop_stuff.sql"), ["PLpgSQL", "SQL", "PLSQL", "SQLPL"])
assert_equal "SQL", results.first[0]
endEnryAs seen above, Bayesian classifier token weights for SQL are very different for the same language disambiguation case. Resolution: fixing this is tacked under #194 |
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
|
CI passes, although there clearly are failing tests :/ scope in PR description updated. |
Co-Authored-By: bzz <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
|
@creachadair @juanjux thank you for taking a look while it's still WIP - all initial feedback addressed in 5fbadc8 |
|
What is super annoying is that with a different output, then on CI. Allthough it clearly should fail on CI as well, as test do not pass |
Signed-off-by: Alexander Bezzubov <[email protected]>
f7228d3 to
ef9311e
Compare
Signed-off-by: Alexander Bezzubov <[email protected]>
|
@creachadair @juanjux all feedback addressed, tests pass, ready to be merged. Sorry for such a long set of changes but scope of this PR was already limited to only a single part of the original #152 (see it's description for updated full scope of the github<->linguist sync) |
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
|
@creachadair feedback addressed in c57bc4a and c4f3dbe |
|
Thank you for prompt reviews @juanjux, @creachadair 🚀 Also thanks for kind explanations and rising the concerns about public API structure, @creachadair ! |
Signed-off-by: Alexander Bezzubov <[email protected]>
ec00f1d to
97ab29a
Compare
|
All feedback addressed, @creachadair it's ready for another round 🙏 |
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
Signed-off-by: Alexander Bezzubov <[email protected]>
|
@creachadair thank you for your kind and useful feedback, I belive it all has been addressed and is ready for another pass. |
Signed-off-by: Alexander Bezzubov <[email protected]>
|
New v7.2.0 of linguist has been released, it includes a fix to one of the issues that affected us so will bump to e4560984058b4726010ca4b8f03ed9d0f8f464db |
Sync to https://github.com/github/linguist/releases/tag/v7.2.0 and update instructions for test generation. Signed-off-by: Alexander Bezzubov <[email protected]>
Sync \w Github Linguist v7.2.0 Includes new way of handling `heuristics.yml` and all `./data/*` re-generated using Github Linguist [v7.2.0](https://github.com/github/linguist/releases/tag/v7.2.0) release tag. - many new languages - better vendoring detection - update doc on update&known issues.
Fixes part of the #155 - generate heuristics from
heuristics.ymlinstead of parsingheuristics.rb.Major code changes include:
./internal/code-generator/heuristics.goto consumeheuristics.ymlinstead ofheuristics.rband produce matchable rule tree./data/./internal/code-generator/test_files/*.goldTODOs:
heuristics.ymlfix new Classifier strategy failures-> moved to Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL" #194.goldtest fixtures