Add a new compute model to Feathr#820
Merged
rakeshkashyap123 merged 0 commit intomainfrom Nov 30, 2022
Merged
Conversation
Member
|
@rakeshkashyap123 can you also help resolve the conflicts? |
Member
|
Could anyone provide some details and docs about this PR? Without doc, no one can even understand what you're trying to do. |
bozhonghu
reviewed
Nov 16, 2022
Collaborator
bozhonghu
left a comment
There was a problem hiding this comment.
Overall looks good to me. Only major concern is we need to ensure the sbt build continues to work and not break any open source development.
feathr-compute/src/main/java/com/linkedin/feathr/compute/SqlUtil.java
Outdated
Show resolved
Hide resolved
src/test/scala/com/linkedin/feathr/offline/TestFeathrUdfPlugins.scala
Outdated
Show resolved
Hide resolved
feathr-impl/src/main/scala/com/linkedin/feathr/sparkcommon/FDSExtractor.scala
Outdated
Show resolved
Hide resolved
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
jaymo001
reviewed
Nov 30, 2022
Collaborator
|
Synced offline. LGTM. |
jaymo001
previously approved these changes
Nov 30, 2022
jaymo001
approved these changes
Nov 30, 2022
windoze
approved these changes
Nov 30, 2022
Member
windoze
left a comment
There was a problem hiding this comment.
As we cannot test CI before merging, so let's merge and test.
Please keep eye on the CI test/build process.
jaymo001
pushed a commit
that referenced
this pull request
Dec 7, 2022
* Add working gradle build * Set up pdl support * Working PDL java code gen * With pdl files from metadata models * With pdl files from compute model * Fix compile for all pdl files * Add working gradle build * Migrate frame-config module into feathr * Migrate fcm graph module to feathr * Add FCM offline execution code, includes FDS metadata code * Add needed jars for feathr-config tests * Switch client to FeathrClient2 for local tests and fix config errors * Fix SWA test * Add gradle wrapper jar * Change name of git PR test from sbt to gradle * Switch python client to use FCM client * Exclude json from dependency * Add hacky solution to handle json dependency conflict in cloud * Add json to local dependency * Add log to debug cloud jar * Add json as dependency * Another attempt at resolving json dependency * Resolve json via shading * Fix json shading * Remove log * Shade typesafe config for cloud jar * Add maven publish code to build.gradle * Add working local maven build and rename frame-config to feathr-config to avoid namespace conflict * Modify sonatype creds * Change so no need to sign if releasing snapshot version * Update build.gradle to allow publishing of all modules * Removed FDS handling from Feathr * All tests working * Deleted FR stuff * Remove dimension and other tensor related stuff * Remove mlfeatureversionurn from defaultvalueresolver * Remove mlfeatureversionurn and featureref * Remove featuredefinition files * Remove featureRef and typedRef * final cleanup * Fix merge conflict bugs * Fix guava error * udf plugin for swa features * row-transformations optimization * fix bug * fix another bug * always execute agg nodes first * Add SWA log * reverse order of execution * group by datasource * Fix bug * Merge main into fcm branch * Remove insecure URLs * Add back removed files * Add back removed files * Add back removed files * Change PR build system to gradle * Change sbt job to gradle jobb * Change sbt workflow:wq * Update maven github workflow to use gradle * fix failing test * remove sbt project module * Remove sbt related files * Change docs to reflect gradle * Remove keywords * Create a single jar * 1. Fix jar not getting populated\n 2. Fix documentation bugs * pubishToMavenLocal Working * With FFE integrated * maven upload working * Update docs and code clean up * add gradle-wrapper file * Push all dependency jars * Update docs * Docs cleanup * Update github workflow commands * Update github workflow * Update workflow syntax * Update version * Add gradle version to github workflow * Update gradle version w/o quotes * Remove github gradle version * Github workflow fix * Github workflow fix-2 * Github workflow fix-4 Co-authored-by: Bozhong Hu <[email protected]> Co-authored-by: rkashyap <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
In this PR:-
We add a new compute model and engine to Feathr for faster performance:-
Compute Model
a. Convert the hocon config model into a data model which is simple to understand, this is stored in feathr-config module.
The hocon config gets converted to
i) AnchorConfig
ii) DerivationConfig
iii) SequentialJoin Config
All the above code is in feathr-config module.
b. The above data models are then converted into a compute graph. A compute graph consists of the following nodes:-
a. DataSource node - An entity to encapsulate the datasource information of a feature.
i) Context - Represents the training dataset information
ii) Table - A table data source contains both a snapshot view and an update log.
iii) Event - An event node contains append-only event logs whose records need to be grouped and aggregated (e.g. counted, averaged, top-K’d) over a limited window of time.
b. Transformation node - An entity to encapsulate all the transformations which are to be done on top of a datasource/other nodes.
c. Lookup node - Maps to the present day sequential join object.
d. Aggregation node - Maps to the currentt SWA join object.
e. External node - A node used as a placeholder while computing other nodes, not to be used by the compute engine.
c. We generate a raw graph using only the feature def config.
d. Once the join config is passed in, we optimize this graph by performing generic graph operations like merge, delete, prune, etc.
All of the above information is in the feathr-compute module.
Compute Engine
a. We pass the above the compute graph into the compute engine which sorts the graph topologically.
b. We group a few related nodes like all SWA nodes, similar transformation function nodes together for faster execution by the engine.
c. Then, we execute the nodes and add the computed features onto the original observation data.
We have made use of PDL data object to store all the data models. This requires us to migrate from SBT to Gradle version.
How was this PR tested?
Currently testing it with multiple flows. The flows either perform faster or have a similar performance. All integration and unit tests pass.
New integration and unit tests have been added.
Does this PR introduce any user-facing changes?
It is completely backward compatible. Also, this would not be the default engine for now, we would continue testing till we achieve complete confidence.