Skip to content

Sampling plus fast n accurate#45

Merged
ashwinzyx merged 21 commits intomainfrom
sampling-plus-fast-n-accurate
Sep 21, 2024
Merged

Sampling plus fast n accurate#45
ashwinzyx merged 21 commits intomainfrom
sampling-plus-fast-n-accurate

Conversation

@aravind10x
Copy link
Contributor

@aravind10x aravind10x commented Sep 21, 2024

  • Added Data sampling
Data sampling

Sampling works on local file/ local directory / URL.
We start by estimating the original size and check if it is above the SAMPLING_SIZE_THRESHOLD
If so, we default to sampling, but the user can override and decide to not perform sampling.
In case of a directory, there may be 2 scenarios - either you have a directory wth a small number of files that are sizeable, and hence needs to be sampled at an individual file level. Or you have a directory with a large number of small files - in which case it'd be more sensible to just pick a subset of files as the sample. Both these cases are handled based on SAMPLING_FILE_SIZE_THRESHOLD.

So, we check the average file size, and if it's > SAMPLING_FILE_SIZE_THRESHOLD, then we peform file-level-sampling - meaning, we sample each file in the directory individually.
On the other hand if avg file size < = SAMPLING_FILE_SIZE_THRESHOLD, then we perform directory-level-sampling - meaning, we choose a subset of files in the directory as the sample set. If an individual chosen file exceeds threshold, we sample it individually as well.

  • Added Optuna
  • Few minor fixes

Copy link
Contributor

@ashwinzyx ashwinzyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ashwinzyx ashwinzyx merged commit ae461c2 into main Sep 21, 2024
@aravind10x aravind10x deleted the sampling-plus-fast-n-accurate branch October 6, 2024 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants