Clear Feathr UDF state and configuration template in work directory#557
Merged
xiaoyongzhu merged 6 commits intomainfrom Aug 8, 2022
Merged
Clear Feathr UDF state and configuration template in work directory#557xiaoyongzhu merged 6 commits intomainfrom
xiaoyongzhu merged 6 commits intomainfrom
Conversation
Yuqing-cat
previously approved these changes
Aug 7, 2022
Yuqing-cat
approved these changes
Aug 8, 2022
blrchen
approved these changes
Aug 8, 2022
ahlag
pushed a commit
to ahlag/feathr
that referenced
this pull request
Aug 26, 2022
…eathr-ai#557) * Remove udf state in feathr work directory * Update _preprocessing_pyudf_manager.py * Update _preprocessing_pyudf_manager.py * update test * fix join/gen job mix issues * Update _preprocessing_pyudf_manager.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The first issue
In the existing code, we didn't remove
generated_feathr_pyspark_metadatawhen building the features. This is problematic in a few end users, in particular: If UDFs are defined and then are removed, since this file is not cleared, the code will still think there are UDFs, which will either yield wrong results, or exist incorrectly.This PR makes sure that we remove
generated_feathr_pyspark_metadataand a few UDF files every time users build features.The second issue (#559 )
This is an issue which isn't very obvious. Sometimes after running
get_offline_features, then runningmaterialize_featuresAPI, thematerialize_featuresAPI will not be successful, and in many cases there's no values in the online store such as Redis.This only happens when using databricks.
This is caused by the fact that if the databricks configuration is not a string (i.e. end users use a dict to provide all the required configurations), then there's a line in the code
submission_params = self.config_templateSince self.config_template is a dict, this is actually a reference rather than a copy of
self.config_template. In the code later,submission_paramswill be modified and the value will be carried over across jobs, which will cause different jobs share the same state, and will cause unexpected behaviors.Other issues
This PR also fixes a few OS compatibility issues (when parsing paths we always assume it's Linux style which isn't true), and fix a few typos.