Conversation
There was a problem hiding this comment.
Ok finished this first round. Most of it is good, but can be improved. You mostly should move the path-context to index logic away from the model. I havent reviewed in detail the Model as it will change after this review, ping e when its done on slack i can have a second look tonight. Overall good job, thks for you hard work 👍
src/code2vec.py
Outdated
| @@ -18,8 +20,10 @@ def code2vec(args): | |||
| .link(UastRow2Document()) \ | |||
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
src/code2vec.py
Outdated
| @@ -39,6 +43,10 @@ def main(): | |||
| required=False) | |||
| parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.", | |||
There was a problem hiding this comment.
create subparser, let's compartmentalize data and ml pipeline -> I'd suggest commands should be extract_features and then train_model (for now only add the first and a TODO for the second, if you have better name pls propose)
There was a problem hiding this comment.
not sure about this bit. should I create two methods like add_extract_features(parser) and add_train_model_args(parser) and add the respective options inside of them?
There was a problem hiding this comment.
i think for one you can omit the second one, as no logic for it exists. anywaay yeah, you can do something like:
subparsers = parser.add_subparsers(help="Commands", dest="command")
extract_parser = subparsers.add_parser(
"extract", help="Extract features from input repositories",
formatter_class=args.ArgumentDefaultsHelpFormatterNoNone)
There was a problem hiding this comment.
I do not know how to do this, I've pushed my trial (which does not work) but I'm not sure how these hierarchies of parser work, the rest should be solved.
src/code2vec.py
Outdated
| required=False) | ||
| parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.", | ||
| required=False) | ||
| parser.add_argument('-k', '--keep_freq', type=bool, default=False, help="Keep frequencies " |
There was a problem hiding this comment.
*keep-freq and initialize the parser like this for the conversion to keep_freq
from sourced.ml.cmd import ArgumentDefaultsHelpFormatterNoNone
...
parser = argparse.ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatterNoNone)
src/code2vec.py
Outdated
| parser.add_argument('-k', '--keep_freq', type=bool, default=False, help="Keep frequencies " | ||
| "when building vocabularies.", required=False) | ||
| parser.add_argument('-o', '--output', type=str, default=os.getcwd(), help="Output were the " | ||
| "model will be stored.", required=False) |
There was a problem hiding this comment.
*"Output path for the Code2VecFeatures model "
src/models/code2vec.py
Outdated
|
|
||
|
|
||
| @register_model | ||
| class Code2Vec(Model): |
There was a problem hiding this comment.
*Code2VecFeatures , we will proabably have a second Model for the trained ML model
src/transformers/vocabulary2id.py
Outdated
| Process rows to gather values and paths with/without their frequencies. | ||
| """ | ||
|
|
||
| r = rows \ |
There was a problem hiding this comment.
rename r to rows and cache after reduceByKey or you will do it twice, which is time consuming, dont forget to uncache after both collects
There was a problem hiding this comment.
not sure how things are cached / uncached in Spark I'm pretty new to it, could you give more details?
There was a problem hiding this comment.
sure. in spark you can either do transformations or actions on RDDs/Dataframes/etc. Those objects are lazily evaluated, which means transformations are not applied if an action does not trigger them:
r.map(lambda x: x[0]).filter(my_condition) # does nothing
r.map(lambda x: x[0]).filter(my_condition).collect() # the action collect triggers transformations
in order to not repeat parts of the pipeline, you can cache the data in RDD or on disk, here:
rows = rows \
.flatMap(self._get_path) \
.reduceByKey(operator.add) \
.persist() # nothing happens
values = rows.filter(lambda x: type(get_elem(x)) == str).collect()
# after this, rows is cached, so doing the second collect starts from
# after the reduceByKey, not before the flatmap
paths = r.filter(lambda x: type(get_elem(x)) == tuple).collect()
rows.unpersist(). # or this will stay in memory/disk
we usually add a persist argument to select level of persistence, for more see here
src/transformers/vocabulary2id.py
Outdated
| r = r.map(lambda x: x[0]) | ||
| get_elem = lambda x: x | ||
|
|
||
| values = r.filter(lambda x: type(get_elem(x)) == str) |
There was a problem hiding this comment.
collect directly here, same for paths
src/models/code2vec.py
Outdated
| self._paths = paths | ||
| self._path_contexts = path_contexts | ||
|
|
||
| self._value2index = {w: i for i, w in enumerate(values)} |
There was a problem hiding this comment.
you should not do this here, using the mapping this way means all the computation to map values/context_paths to index are done in the driver, not using Spark. You can do it in different ways, but simplest is to create the indexes after the collect, broadcast them, then use them as mappings before collecting the (doc, [path_context_1, ...]) RDD.
src/transformers/vocabulary2id.py
Outdated
| """ | ||
| return rows \ | ||
| .map(self._doc2pc) \ | ||
| .reduceByKey(operator.add) |
There was a problem hiding this comment.
if you want to do this, you need to map the RDD to (doc, [pc]), 1 no ? In all cases I think for now it is best to not add calculations for document/featurefreqs, so simply do a distinct after the map (distinct should be done after mapping of the path-contrxes strings to indexes, see comment above)
There was a problem hiding this comment.
the idea was to add the lists for each key to have a (doc, [pc_1, pc_2, ...]) because _doc2pc is a single path context lik (doc, [pc])
There was a problem hiding this comment.
okay i see now, problem is this might give you cases like this: (doc, [pc1, pc2, pc1, pc3 ...]) so you ought to still do a distinct, before the reduceByKey
src/models/code2vec.py
Outdated
| """ | ||
| code2vec model - source code identifier embeddings. | ||
| """ | ||
| NAME = "code2vec" |
e259cfe to
4c06549
Compare
r0mainK
left a comment
There was a problem hiding this comment.
Went over the subparser part, tell me if it still does not work afterwards. I think you could create a __main__.py file for parsing commands, create a cmd folder, and put the rest of the code2vec code in a extract_features file (also rename the function holding the pipleine).
src/code2vec.py
Outdated
| required=False) | ||
| extract_parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.", | ||
| required=False) | ||
| extract_parser.add_argument('-k', '--keep_freq', type=bool, default=False, |
src/code2vec.py
Outdated
| required=False) | ||
| extract_parser.add_argument('-g', '--max_length', type=int, default=5, help="Max path length.", | ||
| required=False) | ||
| extract_parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.", |
src/code2vec.py
Outdated
| required=False) | ||
| parser.add_argument('-w', '--max_width', type=int, default=2, help="Max path width.", | ||
| required=False) | ||
| extract_parser.add_argument('-g', '--max_length', type=int, default=5, help="Max path length.", |
src/code2vec.py
Outdated
| parser = argparse.ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatterNoNone) | ||
|
|
||
| # sourced.engine args | ||
| add_repo2_args(parser) |
There was a problem hiding this comment.
this should be added to the subparser
src/code2vec.py
Outdated
| required=False) | ||
| extract_parser.add_argument('-k', '--keep_freq', type=bool, default=False, | ||
| help="Keep frequencies when building vocabularies.", required=False) | ||
| extract_parser.add_argument('-o', '--output', type=str, default=os.getcwd(), |
There was a problem hiding this comment.
take out default value, and use required=True
src/transformers/vocabulary2id.py
Outdated
| values = rows.filter(lambda x: type(get_elem(x)) == str).collect() | ||
| paths = rows.filter(lambda x: type(get_elem(x)) == tuple).collect() | ||
|
|
||
| value2index = {w: i for i, w in enumerate(values)} |
There was a problem hiding this comment.
this is a problem if keep_freq is True, in one case you get the dictionnaries but in the other you get (key, value): index. I would move the index creation to the call, and use self.docfreq to potentially create the frequency dict, and pass it as an optional dict parameter for the Model that default to None.
src/transformers/vocabulary2id.py
Outdated
| return value2index, path2index | ||
|
|
||
| def _doc2pc(self, row: Row): | ||
| (u, path, v), doc = Vocabulary2Id._unstringify_path_context(row), row[0][1] |
There was a problem hiding this comment.
nest this function in build_doc2pc, and broadcast indexes:
def build_doc2pc(self, rows: RDD):
value2index = self.sc.broadcast(self.value2index)
path2index = self.sc.broadcast(self.path2index)
def _doc2pc(self, row: Row):
(u, path, v), doc = Vocabulary2Id._unstringify_path_context(row), row[0][1]
return doc, (value2index.value[u], path2index.value[path], value2index.value[v])
....
value2index.unpersist(blocking=True)
path2index.unpersist(blocking=True)
src/transformers/vocabulary2id.py
Outdated
|
|
||
|
|
||
| class Vocabulary2Id(Transformer): | ||
| def __init__(self, keep_freq: bool, output: str, **kwargs): |
There was a problem hiding this comment.
add sparkContext to __init___ -> self.sc used later
src/code2vec.py
Outdated
| .link(UastDeserializer()) \ | ||
| .link(Uast2BagFeatures([UastPathsBagExtractor(args.max_length, args.max_width)])) \ | ||
| .link(Collector()) \ | ||
| .link(Vocabulary2Id(args.keep_freq, args.output)) \ |
There was a problem hiding this comment.
here add root.sc to initialize transformer with sparkContext
There was a problem hiding this comment.
I get this error:
File "/home/hydra/projects/code2vec/src/../src/code2vec.py", line 26, in code2vec
.link(Vocabulary2Id(root.sc, args.keep_freq, args.output)) \
AttributeError: 'Engine' object has no attribute 'sc'
src/models/code2vec_features.py
Outdated
|
|
||
| def _load_tree(self, tree): | ||
| self.construct(value2index=tree["value2index"].copy(), | ||
| path2index=split_strings(tree["path2index"]), |
There was a problem hiding this comment.
dont use split_strings on path2index, use it on path_contexts
|
|
||
| def _generate_tree(self): | ||
| return {"value2index": self._value2index, "path2index": self._path2index, | ||
| "path_contexts": self._path_contexts} |
There was a problem hiding this comment.
use merge_strings here for path_contexts
|
@kafkasl added a couple more comments, you need to:
after this it should be good ^^" pls ping me if u need help ! |
a3ae3e6 to
027ea18
Compare
r0mainK
left a comment
There was a problem hiding this comment.
Okay I've added 2 comments for main (taken from sourced.ml), 1 typo and one note on the broadcasting. Once this is done I would like one last thing before merge: split the PR in 3 commits:
- first should be
Add Code2VecFeatures Modeland have only model - second should be
Add Vocabulary2Id transformerand same, have only transformer - last should be
Add main and rework data pipelineand hold everything else
src/__main__.py
Outdated
| extract_parser.add_argument('--max-width', type=int, default=2, help="Max path width.", | ||
| required=False) | ||
| extract_parser.add_argument('-o', '--output', type=str, | ||
| help="Output path for the Code2VecFeatures mode", required=True) |
| extract_parser = subparsers.add_parser("extract", | ||
| help="Extract features from input repositories", | ||
| formatter_class=ArgumentDefaultsHelpFormatterNoNone) | ||
|
|
There was a problem hiding this comment.
add:
extract_parser.set_defaults(handler=code2vec_extract_features)
src/__main__.py
Outdated
|
|
||
| args = parser.parse_args() | ||
|
|
||
| code2vec_extract_features(args) |
There was a problem hiding this comment.
replace this line with:
try:
handler = args.handler
except AttributeError:
def print_usage(_):
parser.print_usage()
handler = print_usage
return handler(args)
| :param value2index_freq: value -> (id, freq) | ||
| :param path2index_freq: path -> (id, freq) | ||
| """ | ||
|
|
There was a problem hiding this comment.
you are broadcasting frequencies uselessly here, better to use dict comprehension beforehand:
value2index = self.sc.broadcast({key: id for key, (id, _) in value2index_freq.items()})
Signed-off-by: Pol Alvarez Vecino <[email protected]>
Signed-off-by: Pol Alvarez Vecino <[email protected]>
Signed-off-by: Pol Alvarez Vecino <[email protected]>
This PR is not finished but the code2vec model and Vocabulary2id are pretty much done, so I thought you could start reviewing it in case something is really off.
Currently I'm getting an error when trying to save the model and I'm not sure why. Any help would be appreciated: