Refactor Task class to include multiple rolling window evaluations#33
Refactor Task class to include multiple rolling window evaluations#33shchur merged 15 commits intopre-v1.0.0from
Task class to include multiple rolling window evaluations#33Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
src/fev/__about__.py
Outdated
| @@ -1 +1 @@ | |||
| __version__ = "0.6.0rc2" | |||
| __version__ = "1.0.0" | |||
There was a problem hiding this comment.
I think we'll need to bump the version to v1.0.0 given how breaking the changes are
There was a problem hiding this comment.
I have created a new branch pre-v1.0.0. We can merge all PRs before the next release into this branch to avoid making the documentation on main out-of-sync with the latest stable release.
There was a problem hiding this comment.
I don't understand why we need to bump it? Are we sure fev is now stable? If not, I'd stick to 0.7 as per semver.
There was a problem hiding this comment.
also, is __about__.py standard / meaningfully consumed by setuptools or similar? why not just define in init.py?
There was a problem hiding this comment.
Makes sense, I moved this to __init__.py. Using __about__.py has the advantage of being able to check the version without loading the full package, but it's probably not worth it.
Task class to include multiple rolling window evaluationsTask class to include multiple rolling window evaluations
| if as_univariate: | ||
| if univariate_target_column in past.column_names and univariate_target_column != task.target_column: | ||
| # Raise error if column called `univariate_target_column` already exists and it's not the *only* target column | ||
| if univariate_target_column in past.column_names and window.target_columns_list != [univariate_target_column]: |
There was a problem hiding this comment.
Will we hit this often? If yes, maybe it makes sense to change the default to something which would lead to fewer collisions?
There was a problem hiding this comment.
We can only hit this if there is a columns called target in the dataset that is not actually the only target column. This looks like a really rare case to me, so I wouldn't worry about too much.
canerturkmen
left a comment
There was a problem hiding this comment.
Looking great! Dropping some comments.
| data loading. | ||
| window_step_size : int | str | None | ||
| Step size between consecutive evaluation windows. Must be an integer if `initial_cutoff` is an integer. | ||
| Can be an integer or pandas offset string (e.g., 'D', '15min') if `initial_cutoff` is a timestamp. |
There was a problem hiding this comment.
hmm. if this pandas candy is there, as a user I expect it to be there for horizon as well for consistency?
There was a problem hiding this comment.
Making horizon: int | str can be really annoying for the users since then they won't be able to set MyModel(prediction_length=task.horizon) anymore, and will require some complex pandas offset manipulations.
In contrast, I don't think that user will ever want to access task.window_step_size since this parameter just configures which windows will be produced by iter_windows().
src/fev/__about__.py
Outdated
| @@ -1 +1 @@ | |||
| __version__ = "0.6.0rc2" | |||
| __version__ = "1.0.0" | |||
There was a problem hiding this comment.
I don't understand why we need to bump it? Are we sure fev is now stable? If not, I'd stick to 0.7 as per semver.
| future: datasets.Dataset, | ||
| *, | ||
| target_column: str | list[str], | ||
| target_columns_list: list[str], |
There was a problem hiding this comment.
nit: if at all possible I would avoid type encodings in variables (i.e., ..._list) unless in local variables where they clearly serve to disambiguate.
There was a problem hiding this comment.
The name here was chosen to mimic the task property Task.target_columns_list
Lines 727 to 733 in 9a835d0
I think the current naming where target_column can be of type str | list[str] is a bit counterintuitive. How about we rename as follows in a follow-up PR?
Task.target_column->target: str | list[str]Task.target_columns_list->target_columns: list[str]
There was a problem hiding this comment.
up to you. I understand the problem, but the solution doesn't help me.
| raise ValueError("`window_step_size` must be an int if `initial_cutoff` is an int") | ||
| assert self.window_step_size >= 1 | ||
| max_allowed_cutoff = -self.horizon - (self.num_windows - 1) * self.window_step_size | ||
| if self.initial_cutoff < 0 and self.initial_cutoff > max_allowed_cutoff: |
There was a problem hiding this comment.
I don't quite follow this check. So if I understand correctly I can only specify initial cutoff to allow for at least the required number of steps (roughly horizon * nr_windows) in the future. if so, why is it not checked for timestamp as well?
There was a problem hiding this comment.
In case of timestamp-based cutoffs we cannot perform this check during Task.__post_init__ since we don't yet know which timestamps are available in the dataset.
If there is not enough data to generate the splits with a timestamp-based initial_cutoff, an error will be raised once the user accesses EvaluationWindow.get_input_data() for the window where not enough data is available.
For an integer-based negative initial_cutoff, we can immediately tell if the task configuration is invalid during __post_init__, so this check here is just to save user's time.
canerturkmen
left a comment
There was a problem hiding this comment.
Target branch should likely be master. Otherwise, LGTM!
@canerturkmen I will merge |
Issue #, if available:
Description of changes:
TaskGeneratorclassTaskEvaluationWindowclass that corresponds to a single cutoff in the original datasetTo do:
Task/EvaluationWindowmethodsBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.