Add rest of data pipeline by kafkasl · Pull Request #5 · src-d/code2vec

kafkasl · 2018-07-16T07:53:21Z

This PR is not finished but the code2vec model and Vocabulary2id are pretty much done, so I thought you could start reviewing it in case something is really off.

Currently I'm getting an error when trying to save the model and I'm not sure why. Any help would be appreciated:

  File "/home/hydra/projects/code2vec/src/../src/code2vec.py", line 23, in code2vec
    .link(Vocabulary2Id(args.keep_freq, args.output)) \
  File "/usr/local/lib/python3.5/dist-packages/sourced/ml/transformers/transformer.py", line 95, in execute
    head = node(head)
  File "/home/hydra/projects/code2vec/src/transformers/vocabulary2id.py", line 28, in __call__
    
  File "/usr/local/lib/python3.5/dist-packages/modelforge/model.py", line 269, in save
    self._write_tree(tree, output)
  File "/usr/local/lib/python3.5/dist-packages/modelforge/model.py", line 285, in _write_tree
    asdf.AsdfFile(final_tree).write_to(output, all_array_compression=ARRAY_COMPRESSION)
  File "/usr/local/lib/python3.5/dist-packages/asdf/asdf.py", line 899, in write_to
    self._post_write(fd)
  File "/usr/local/lib/python3.5/dist-packages/asdf/generic_io.py", line 313, in __exit__
    self._fd.__exit__(type, value, traceback)
  File "/usr/local/lib/python3.5/dist-packages/asdf/extern/atomicfile.py", line 113, in __exit__
    self.close()
  File "/usr/local/lib/python3.5/dist-packages/asdf/extern/atomicfile.py", line 109, in close
    atomic_rename(self._tmp_filename, self._filename)
NotADirectoryError: [Errno 20] Not a directory: '/home/hydra/projects/code2vec/results/.___atomic_writeg1zpv8zx' -> '/home/hydra/projects/code2vec/src/../results/'

r0mainK

Ok finished this first round. Most of it is good, but can be improved. You mostly should move the path-context to index logic away from the model. I havent reviewed in detail the Model as it will change after this review, ping e when its done on slack i can have a second look tonight. Overall good job, thks for you hard work 👍

src/code2vec.py

@@ -18,8 +20,10 @@ def code2vec(args):
        .link(UastRow2Document()) \


r0mainK · 2018-07-16T14:08:52Z

src/code2vec.py

@@ -39,6 +43,10 @@ def main():
                        required=False)
    parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.",


create subparser, let's compartmentalize data and ml pipeline -> I'd suggest commands should be extract_features and then train_model (for now only add the first and a TODO for the second, if you have better name pls propose)

not sure about this bit. should I create two methods like add_extract_features(parser) and add_train_model_args(parser) and add the respective options inside of them?

i think for one you can omit the second one, as no logic for it exists. anywaay yeah, you can do something like:

subparsers = parser.add_subparsers(help="Commands", dest="command") extract_parser = subparsers.add_parser( "extract", help="Extract features from input repositories", formatter_class=args.ArgumentDefaultsHelpFormatterNoNone)

I do not know how to do this, I've pushed my trial (which does not work) but I'm not sure how these hierarchies of parser work, the rest should be solved.

r0mainK · 2018-07-16T14:11:20Z

src/code2vec.py

                        required=False)
    parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.",
                        required=False)
+    parser.add_argument('-k', '--keep_freq', type=bool, default=False, help="Keep frequencies "


*keep-freq and initialize the parser like this for the conversion to keep_freq

from sourced.ml.cmd import ArgumentDefaultsHelpFormatterNoNone ... parser = argparse.ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatterNoNone)

r0mainK · 2018-07-16T14:11:59Z

src/code2vec.py

+    parser.add_argument('-k', '--keep_freq', type=bool, default=False, help="Keep frequencies "
+                        "when building  vocabularies.", required=False)
+    parser.add_argument('-o', '--output', type=str, default=os.getcwd(), help="Output were the "
+                        "model will be stored.", required=False)


*"Output path for the Code2VecFeatures model "

r0mainK · 2018-07-16T14:14:06Z

src/models/code2vec.py

+
+
+@register_model
+class Code2Vec(Model):


*Code2VecFeatures , we will proabably have a second Model for the trained ML model

r0mainK · 2018-07-16T14:43:22Z

src/transformers/vocabulary2id.py

+        Process rows to gather values and paths with/without their frequencies.
+        """
+
+        r = rows \


rename r to rows and cache after reduceByKey or you will do it twice, which is time consuming, dont forget to uncache after both collects

not sure how things are cached / uncached in Spark I'm pretty new to it, could you give more details?

sure. in spark you can either do transformations or actions on RDDs/Dataframes/etc. Those objects are lazily evaluated, which means transformations are not applied if an action does not trigger them:

r.map(lambda x: x[0]).filter(my_condition) # does nothing r.map(lambda x: x[0]).filter(my_condition).collect() # the action collect triggers transformations

in order to not repeat parts of the pipeline, you can cache the data in RDD or on disk, here:

rows = rows \ .flatMap(self._get_path) \ .reduceByKey(operator.add) \ .persist() # nothing happens values = rows.filter(lambda x: type(get_elem(x)) == str).collect() # after this, rows is cached, so doing the second collect starts from # after the reduceByKey, not before the flatmap paths = r.filter(lambda x: type(get_elem(x)) == tuple).collect() rows.unpersist(). # or this will stay in memory/disk

we usually add a persist argument to select level of persistence, for more see here

r0mainK · 2018-07-16T14:44:33Z

src/transformers/vocabulary2id.py

+            r = r.map(lambda x: x[0])
+            get_elem = lambda x: x
+
+        values = r.filter(lambda x: type(get_elem(x)) == str)


collect directly here, same for paths

r0mainK · 2018-07-16T14:54:11Z

src/models/code2vec.py

+        self._paths = paths
+        self._path_contexts = path_contexts
+
+        self._value2index = {w: i for i, w in enumerate(values)}


you should not do this here, using the mapping this way means all the computation to map values/context_paths to index are done in the driver, not using Spark. You can do it in different ways, but simplest is to create the indexes after the collect, broadcast them, then use them as mappings before collecting the (doc, [path_context_1, ...]) RDD.

r0mainK · 2018-07-16T14:57:30Z

src/transformers/vocabulary2id.py

+        """
+        return rows \
+            .map(self._doc2pc) \
+            .reduceByKey(operator.add)


if you want to do this, you need to map the RDD to (doc, [pc]), 1 no ? In all cases I think for now it is best to not add calculations for document/featurefreqs, so simply do a distinct after the map (distinct should be done after mapping of the path-contrxes strings to indexes, see comment above)

the idea was to add the lists for each key to have a (doc, [pc_1, pc_2, ...]) because _doc2pc is a single path context lik (doc, [pc])

okay i see now, problem is this might give you cases like this: (doc, [pc1, pc2, pc1, pc3 ...]) so you ought to still do a distinct, before the reduceByKey

r0mainK · 2018-07-16T14:57:57Z

src/models/code2vec.py

+    """
+    code2vec model - source code identifier embeddings.
+    """
+    NAME = "code2vec"


same add features here

r0mainK

Went over the subparser part, tell me if it still does not work afterwards. I think you could create a __main__.py file for parsing commands, create a cmd folder, and put the rest of the code2vec code in a extract_features file (also rename the function holding the pipleine).

r0mainK · 2018-07-17T12:51:49Z

src/code2vec.py

+                                required=False)
+    extract_parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.",
+                                required=False)
+    extract_parser.add_argument('-k', '--keep_freq', type=bool, default=False,


r0mainK · 2018-07-17T12:51:59Z

src/code2vec.py

-                        required=False)
+    extract_parser.add_argument('-g', '--max_length', type=int, default=5, help="Max path length.",
+                                required=False)
+    extract_parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.",


r0mainK · 2018-07-17T12:52:09Z

src/code2vec.py

-                        required=False)
-    parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.",
-                        required=False)
+    extract_parser.add_argument('-g', '--max_length', type=int, default=5, help="Max path length.",


*max-length

r0mainK · 2018-07-17T12:53:56Z

src/code2vec.py

+    parser = argparse.ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatterNoNone)

    # sourced.engine args
    add_repo2_args(parser)


this should be added to the subparser

r0mainK · 2018-07-17T12:54:52Z

src/code2vec.py

+                                required=False)
+    extract_parser.add_argument('-k', '--keep_freq', type=bool, default=False,
+                                help="Keep frequencies when building  vocabularies.", required=False)
+    extract_parser.add_argument('-o', '--output', type=str, default=os.getcwd(),


take out default value, and use required=True

r0mainK · 2018-07-18T09:14:54Z

src/transformers/vocabulary2id.py

+        values = rows.filter(lambda x: type(get_elem(x)) == str).collect()
+        paths = rows.filter(lambda x: type(get_elem(x)) == tuple).collect()
+
+        value2index = {w: i for i, w in enumerate(values)}


this is a problem if keep_freq is True, in one case you get the dictionnaries but in the other you get (key, value): index. I would move the index creation to the call, and use self.docfreq to potentially create the frequency dict, and pass it as an optional dict parameter for the Model that default to None.

r0mainK · 2018-07-18T09:20:27Z

src/transformers/vocabulary2id.py

+        return value2index, path2index
+
+    def _doc2pc(self, row: Row):
+        (u, path, v), doc = Vocabulary2Id._unstringify_path_context(row), row[0][1]


nest this function in build_doc2pc, and broadcast indexes:

def build_doc2pc(self, rows: RDD): value2index = self.sc.broadcast(self.value2index) path2index = self.sc.broadcast(self.path2index) def _doc2pc(self, row: Row): (u, path, v), doc = Vocabulary2Id._unstringify_path_context(row), row[0][1] return doc, (value2index.value[u], path2index.value[path], value2index.value[v]) .... value2index.unpersist(blocking=True) path2index.unpersist(blocking=True)

r0mainK · 2018-07-18T09:21:03Z

src/transformers/vocabulary2id.py

+
+
+class Vocabulary2Id(Transformer):
+    def __init__(self, keep_freq: bool, output: str, **kwargs):


add sparkContext to __init___ -> self.sc used later

r0mainK · 2018-07-18T09:21:50Z

src/code2vec.py

        .link(UastDeserializer()) \
        .link(Uast2BagFeatures([UastPathsBagExtractor(args.max_length, args.max_width)])) \
-        .link(Collector()) \
+        .link(Vocabulary2Id(args.keep_freq, args.output)) \


here add root.sc to initialize transformer with sparkContext

I get this error:

File "/home/hydra/projects/code2vec/src/../src/code2vec.py", line 26, in code2vec .link(Vocabulary2Id(root.sc, args.keep_freq, args.output)) \ AttributeError: 'Engine' object has no attribute 'sc'

r0mainK · 2018-07-18T09:24:56Z

src/models/code2vec_features.py

+
+    def _load_tree(self, tree):
+        self.construct(value2index=tree["value2index"].copy(),
+                       path2index=split_strings(tree["path2index"]),


dont use split_strings on path2index, use it on path_contexts

r0mainK · 2018-07-18T09:25:15Z

src/models/code2vec_features.py

+
+    def _generate_tree(self):
+        return {"value2index": self._value2index, "path2index": self._path2index,
+                "path_contexts": self._path_contexts}


use merge_strings here for path_contexts

r0mainK · 2018-07-18T09:30:07Z

@kafkasl added a couple more comments, you need to:

broadcast the indexes, it will increase spark performance
rework a bit you Model to add frequencies dict if asked for, and rework inner methods, also add a dump method summarizing data, check out other models to see convention
rename code2vec, create main (last round of comments)

after this it should be good ^^" pls ping me if u need help !

r0mainK

Okay I've added 2 comments for main (taken from sourced.ml), 1 typo and one note on the broadcasting. Once this is done I would like one last thing before merge: split the PR in 3 commits:

first should be Add Code2VecFeatures Model and have only model
second should be Add Vocabulary2Id transformer and same, have only transformer
last should be Add main and rework data pipeline and hold everything else

r0mainK · 2018-07-18T13:41:33Z

src/__main__.py

+    extract_parser.add_argument('--max-width', type=int, default=2, help="Max path width.",
+                                required=False)
+    extract_parser.add_argument('-o', '--output', type=str,
+                                help="Output path for the Code2VecFeatures mode", required=True)


r0mainK · 2018-07-18T13:45:34Z

src/__main__.py

+    extract_parser = subparsers.add_parser("extract",
+                                           help="Extract features from input repositories",
+                                           formatter_class=ArgumentDefaultsHelpFormatterNoNone)
+


add:

extract_parser.set_defaults(handler=code2vec_extract_features)

r0mainK · 2018-07-18T13:46:24Z

src/__main__.py

+
+    args = parser.parse_args()
+
+    code2vec_extract_features(args)


replace this line with:

try: handler = args.handler except AttributeError: def print_usage(_): parser.print_usage() handler = print_usage return handler(args)

r0mainK · 2018-07-18T13:51:38Z

src/transformers/vocabulary2id.py

+        :param value2index_freq: value -> (id, freq)
+        :param path2index_freq: path -> (id, freq)
+        """
+


you are broadcasting frequencies uselessly here, better to use dict comprehension beforehand:

value2index = self.sc.broadcast({key: id for key, (id, _) in value2index_freq.items()})

Signed-off-by: Pol Alvarez Vecino <[email protected]>

r0mainK suggested changes Jul 16, 2018

View reviewed changes

kafkasl force-pushed the master branch 3 times, most recently from e259cfe to 4c06549 Compare July 17, 2018 12:49

r0mainK reviewed Jul 17, 2018

View reviewed changes

kafkasl force-pushed the master branch from 4c06549 to b406d18 Compare July 17, 2018 16:42

r0mainK reviewed Jul 18, 2018

View reviewed changes

kafkasl force-pushed the master branch 2 times, most recently from a3ae3e6 to 027ea18 Compare July 18, 2018 13:31

r0mainK suggested changes Jul 18, 2018

View reviewed changes

kafkasl added 3 commits July 18, 2018 16:33

Add Code2VecFeatures Model

d49ba88

Signed-off-by: Pol Alvarez Vecino <[email protected]>

Add Vocabulary2Id transformer

45ab7e8

Signed-off-by: Pol Alvarez Vecino <[email protected]>

Add main and rework data pipeline

fa733ac

Signed-off-by: Pol Alvarez Vecino <[email protected]>

kafkasl force-pushed the master branch from 027ea18 to fa733ac Compare July 18, 2018 14:35

r0mainK approved these changes Jul 18, 2018

View reviewed changes

r0mainK changed the title ~~WIP: Added Vocabulary builder and model~~ Add rest of data pipeline Jul 18, 2018

r0mainK merged commit 19d8c02 into src-d:master Jul 18, 2018

		@@ -18,8 +20,10 @@ def code2vec(args):
		.link(UastRow2Document()) \

		@@ -39,6 +43,10 @@ def main():
		required=False)
		parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.",



		class Vocabulary2Id(Transformer):
		def __init__(self, keep_freq: bool, output: str, **kwargs):

Conversation

kafkasl commented Jul 16, 2018

Uh oh!

r0mainK left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r0mainK Jul 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r0mainK left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r0mainK commented Jul 18, 2018

Uh oh!

r0mainK left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

r0mainK left a comment •

edited

Loading

r0mainK Jul 17, 2018 •

edited

Loading