Sync to linguist 7.2.0: heuristics.yml support by bzz · Pull Request #189 · src-d/enry

bzz · 2019-01-09T20:25:27Z

Fixes part of the #155 - generate heuristics from heuristics.yml instead of parsing heuristics.rb.

Major code changes include:

update code generator for heuristics strategy in ./internal/code-generator/heuristics.go to consume heuristics.yml instead of heuristics.rb and produce matchable rule tree
re-generated all datastructures in ./data/
re-generated gold test results in ./internal/code-generator/test_files/*.gold

TODOs:

update linguist version & re-generate data
adapt content heuristics generator for consuming new heuristics.yml
update doc in README
match language Alias resolution to github linguist one
~~fix new Classifier strategy failures~~ -> moved to Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL" #194
fix new ByExtension strategy failures
re-generate .gold test fixtures
make CI fail on failing tests 2348fd2
verify godoc presence for new things (cleanup \w docs everything else Ensure exported names in Go packages are properly documented #195)

Signed-off-by: Alexander Bezzubov <[email protected]>

Includes only the generated code. Re-generated all `./data/*` heuristics matchers using Github Linguist [e761f9b013e5b61161481fcb898b59721ee40e3d](https://github.com/github/linguist/tree/e761f9b013e5b61161481fcb898b59721ee40e3d) commit - many new languages - better vendoring detection Signed-off-by: Alexander Bezzubov <[email protected]>

Includes: - update to content heuristic generator - generated code in data/content.go to keep commits atomic. Signed-off-by: Alexander Bezzubov <[email protected]>

Includes generated code, to keep commits atomic. Consits of: - code generator for alias produces new API - retrofiting all clients to a new API - generated code data/aliases.go Signed-off-by: Alexander Bezzubov <[email protected]>

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-01-25T12:35:16Z

Got back from vacation and keep debugging the case of failing Bayesian classifier for content on samples/SQL/drop_stuff.sql - SQL (linguist) vs PLpgSQL (enry).

Seems like this has to do with difference in how linguist and enry tokenize the content.

Linguist

Adding to test/test_classifier.rb

  def test_classify_sql
    results = Classifier.classify(Samples.cache, fixture("SQL/drop_stuff.sql"), ["PLpgSQL", "SQL", "PLSQL", "SQLPL"])
    assert_equal "SQL", results.first[0]
  end

$ LINGUIST_DEBUG=2 bundle exec rake test TEST=test/test_classifier.rb TESTOPTS="--name=test_classify_sql -v"

                    #     PLSQL   PLpgSQL       SQL     SQLPL
               ;   10     19.914    22.974         -    21.265
            body    1      7.934         -     5.638         -
         cascade    1          -         -     5.638     8.044
            drop   10          -   102.321    79.408         -
        function    1      9.032    10.969     5.638         -
   functionname1    1          -         -     5.638         -
linguist_package    2          -         -    12.663         -
         package    2     17.254         -    12.663         -
       procedure    1      8.850         -     5.638     9.431
           table    2     16.678    15.632    14.049         -
            type    2      3.569     1.810         -     1.593
       typename1    1          -         -     5.638         -
       typename2    1          -         -     5.638         -
            view    2          -         -    13.474         -
       viewname1    1          -         -     5.638         -
       viewname2    1          -         -     5.638         -
   who_called_me    1      7.934         -     5.638         -
               x    1          -         -     6.331         -
               y    1          -         -     6.331         -

   PLpgSQL =   -330.972 +  -5.697 =   -336.668
       SQL =   -283.377 +  -5.158 =   -288.535
     PLSQL =   -393.512 +  -5.563 =   -399.075
     SQLPL =   -444.345 +  -5.851 =   -450.196

Enry

$ ENRY_DEBUG=1 ENRY_TEST_REPO="$PWD/.linguist" go test -run '^TestRegexpEdgeCases$'

                    #     PLSQL   PLpgSQL       SQL     SQLPL
               ;   10     34.270    37.049         -    35.420
            body    1      7.990         -     4.405         -
         cascade    1          -         -     4.405     8.217
            drop   10          -   103.600    67.077         -
        function    1      9.936    11.097     4.405         -
    functionname    1          -         -     4.405         -
linguist_package    2          -         -    10.197         -
         package    2     17.366         -    10.197         -
       procedure    1      8.906         -     4.405     9.603
           table    2     17.366    15.888    11.583         -
            type    2      7.169     5.554         -     5.426
        typename    2          -         -    10.197         -
            view    2          -         -    11.008         -
        viewname    2          -         -    10.197         -
   who_called_me    1      7.990         -     4.405         -
               x    1          -         -     5.098         -
               y    1          -         -     5.098         -

   PLpgSQL =   -334.671 +  -5.693 =   -340.364
       SQL =   -340.778 +  -5.154 =   -345.932
     SQLPL =   -449.195 +  -5.847 =   -455.042
     PLSQL =   -396.868 +  -5.559 =   -402.427

As seen above, Bayesian classifier token weights for SQL are very different for the same language disambiguation case.

Resolution: fixing this is tacked under #194

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-01-28T14:15:38Z

CI passes, although there clearly are failing tests :/

--- FAIL: TestRegexpEdgeCases (0.00s)
	require.go:794: 
			Error Trace:	common_test.go:45
			Error:      	Received unexpected error:
			            	open .linguist/samples/ActionScript/FooBar.as: no such file or directory
			Test:       	TestRegexpEdgeCases
=== RUN   Test_EnryTestSuite
Cloning Linguist repo to '/tmp/linguist-735567248' as ENRY_TEST_REPO was not set

scope in PR description updated.

common.go

internal/code-generator/generator/documentation.go

internal/code-generator/generator/generator.go

internal/code-generator/generator/generator_test.go

internal/code-generator/generator/heuristics_test.go

Co-Authored-By: bzz <[email protected]>

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-01-28T19:22:22Z

@creachadair @juanjux thank you for taking a look while it's still WIP - all initial feedback addressed in 5fbadc8

bzz · 2019-01-28T19:26:34Z

What is super annoying is that make test does fail on my local env but not on CI

2019/01/28 20:23:37 exit status 128
FAIL	gopkg.in/src-d/enry.v1	0.062s

with a different output, then on CI.

Allthough it clearly should fail on CI as well, as test do not pass

=== RUN   TestRegexpEdgeCases
--- FAIL: TestRegexpEdgeCases (0.00s)
	require.go:794: 
			Error Trace:	common_test.go:45
			Error:      	Received unexpected error:
			            	open .linguist/samples/ActionScript/FooBar.as: no such file or directory
			Test:       	TestRegexpEdgeCases

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-01-29T20:26:38Z

@creachadair @juanjux all feedback addressed, tests pass, ready to be merged.

Sorry for such a long set of changes but scope of this PR was already limited to only a single part of the original #152 (see it's description for updated full scope of the github<->linguist sync)

CONTRIBUTING.md

README.md

common_test.go

data/heuristics.go

internal/code-generator/assets/alias.go.tmpl

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-01-29T23:43:06Z

@creachadair feedback addressed in c57bc4a and c4f3dbe

bzz · 2019-01-30T17:13:39Z

Thank you for prompt reviews @juanjux, @creachadair 🚀

Also thanks for kind explanations and rising the concerns about public API structure, @creachadair !
Will address those first thing tomorrow and let you know.

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-02-05T22:03:30Z

All feedback addressed, @creachadair it's ready for another round 🙏

data/rule/rule.go

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-02-13T21:47:28Z

@creachadair thank you for your kind and useful feedback, I belive it all has been addressed and is ready for another pass.

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-02-14T08:30:47Z

New v7.2.0 of linguist has been released, it includes a fix to one of the issues that affected us so will bump to e4560984058b4726010ca4b8f03ed9d0f8f464db

Sync to https://github.com/github/linguist/releases/tag/v7.2.0 and update instructions for test generation. Signed-off-by: Alexander Bezzubov <[email protected]>

Sync \w Github Linguist v7.2.0 Includes new way of handling `heuristics.yml` and all `./data/*` re-generated using Github Linguist [v7.2.0](https://github.com/github/linguist/releases/tag/v7.2.0) release tag. - many new languages - better vendoring detection - update doc on update&known issues.

bzz added 2 commits January 9, 2019 21:19

gen: small refactoring of generator.Documentation

9c8053f

Signed-off-by: Alexander Bezzubov <[email protected]>

generator: godoc update

ac13a24

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz mentioned this pull request Jan 9, 2019

Sync with github/linguist #155

Open

4 tasks

bzz force-pushed the maintenance/sync-to-linguist-7.1.3 branch from 8999d31 to 73e84fd Compare January 9, 2019 22:28

bzz added 7 commits January 9, 2019 23:40

gen: initial version of new content heuristics

d5b665b

Signed-off-by: Alexander Bezzubov <[email protected]>

gen: skip&report unsupported regexp syntax

6fd4849

Signed-off-by: Alexander Bezzubov <[email protected]>

gen: adjust Ruby regexp syntax to RE2

5505ed2

Includes: - update to content heuristic generator - generated code in data/content.go to keep commits atomic. Signed-off-by: Alexander Bezzubov <[email protected]>

gen: same alias/language name lookup as in linguist

3051773

Includes generated code, to keep commits atomic. Consits of: - code generator for alias produces new API - retrofiting all clients to a new API - generated code data/aliases.go Signed-off-by: Alexander Bezzubov <[email protected]>

gen: fix regexp Or syntax

dceb95a

Signed-off-by: Alexander Bezzubov <[email protected]>

test: add test for content heuristics edge cases

df7844e

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz force-pushed the maintenance/sync-to-linguist-7.1.3 branch from 73e84fd to df7844e Compare January 9, 2019 22:40

gen: fix aliases case

6472714

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz changed the title ~~WIP Sync to linguist 7.1.3~~ WIP Sync to linguist 7.1.3: heuristics.yml support Jan 28, 2019

bzz mentioned this pull request Jan 28, 2019

Bayesian classifier cann't distinguish "SQL" vs "PLpgSQL" #194

Open

bzz self-assigned this Jan 28, 2019

bzz added 3 commits January 28, 2019 14:10

doc: update to currenet state

43d1c6d

Signed-off-by: Alexander Bezzubov <[email protected]>

test: update to expect names + .gold generation

33bb5a6

Signed-off-by: Alexander Bezzubov <[email protected]>

test: update *.gold results

2049da9

Signed-off-by: Alexander Bezzubov <[email protected]>

juanjux reviewed Jan 28, 2019

View reviewed changes

common.go Outdated Show resolved Hide resolved

bzz requested review from creachadair, dennwc and juanjux January 28, 2019 18:15

creachadair reviewed Jan 28, 2019

View reviewed changes

creachadair and others added 2 commits January 28, 2019 19:49

Apply suggestions from code review

63f3661

Co-Authored-By: bzz <[email protected]>

Address review feedback

5fbadc8

Signed-off-by: Alexander Bezzubov <[email protected]>

tests: document edge case and fix the tests

ef9311e

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz changed the title ~~WIP Sync to linguist 7.1.3: heuristics.yml support~~ Sync to linguist 7.1.3: heuristics.yml support Jan 29, 2019

bzz force-pushed the maintenance/sync-to-linguist-7.1.3 branch from f7228d3 to ef9311e Compare January 29, 2019 17:02

gen: add gold test restuls generation + docs sync better

fb61eaa

Signed-off-by: Alexander Bezzubov <[email protected]>

creachadair reviewed Jan 29, 2019

View reviewed changes

bzz added 2 commits January 30, 2019 00:34

gen: add missing GoDoc

c57bc4a

Signed-off-by: Alexander Bezzubov <[email protected]>

cleanup, addressing code review feedback

c4f3dbe

Signed-off-by: Alexander Bezzubov <[email protected]>

juanjux approved these changes Jan 30, 2019

View reviewed changes

bzz mentioned this pull request Feb 5, 2019

Code generation depends on hard-coded fs path #201

Open

heuristics: refactoring, extracting rule package

97ab29a

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz force-pushed the maintenance/sync-to-linguist-7.1.3 branch from ec00f1d to 97ab29a Compare February 5, 2019 22:01

creachadair reviewed Feb 5, 2019

View reviewed changes

bzz added 4 commits February 13, 2019 22:23

review: remove indirection + expose Heuristic

ec54891

Signed-off-by: Alexander Bezzubov <[email protected]>

gen: update generated code

191aa8c

Signed-off-by: Alexander Bezzubov <[email protected]>

review: do not export pattens

6d601d6

Signed-off-by: Alexander Bezzubov <[email protected]>

review: nuke regexp dependency

bff3a15

Signed-off-by: Alexander Bezzubov <[email protected]>

creachadair approved these changes Feb 13, 2019

View reviewed changes

rule: small godoc cleanup

3872cbd

Signed-off-by: Alexander Bezzubov <[email protected]>

bzz changed the title ~~Sync to linguist 7.1.3: heuristics.yml support~~ Sync to linguist 7.2.0: heuristics.yml support Feb 14, 2019

sync \w linguist v7.2.0

fc48fc4

Sync to https://github.com/github/linguist/releases/tag/v7.2.0 and update instructions for test generation. Signed-off-by: Alexander Bezzubov <[email protected]>

bzz merged commit 3499750 into src-d:master Feb 14, 2019

bzz deleted the maintenance/sync-to-linguist-7.1.3 branch February 14, 2019 11:47

bzz mentioned this pull request Apr 3, 2019

Add IsGenerated function to the API #213

Closed

Conversation

bzz commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bzz commented Jan 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linguist

Enry

Uh oh!

bzz commented Jan 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bzz commented Jan 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bzz commented Jan 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bzz commented Jan 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bzz commented Jan 29, 2019

Uh oh!

bzz commented Jan 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bzz commented Feb 5, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bzz commented Feb 13, 2019

Uh oh!

bzz commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bzz commented Jan 9, 2019 •

edited

Loading

bzz commented Jan 25, 2019 •

edited

Loading

bzz commented Jan 28, 2019 •

edited

Loading

bzz commented Jan 28, 2019 •

edited

Loading

bzz commented Jan 28, 2019 •

edited

Loading

bzz commented Jan 29, 2019 •

edited

Loading

bzz commented Jan 30, 2019 •

edited

Loading

bzz commented Feb 14, 2019 •

edited

Loading