Skip to content

feat: add benchmark adapted for sift1m#301

Closed
davidbp wants to merge 11 commits intomainfrom
feat-benchmark-sift
Closed

feat: add benchmark adapted for sift1m#301
davidbp wants to merge 11 commits intomainfrom
feat-benchmark-sift

Conversation

@davidbp
Copy link
Copy Markdown
Contributor

@davidbp davidbp commented Apr 25, 2022

No description provided.

@github-actions github-actions bot added size/l and removed size/m labels Apr 26, 2022
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 26, 2022

Codecov Report

Merging #301 (1ce1433) into main (1482421) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #301      +/-   ##
==========================================
+ Coverage   86.51%   86.53%   +0.02%     
==========================================
  Files         134      134              
  Lines        6385     6388       +3     
==========================================
+ Hits         5524     5528       +4     
+ Misses        861      860       -1     
Flag Coverage Δ
docarray 86.53% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
docarray/__init__.py 75.00% <100.00%> (ø)
docarray/array/storage/weaviate/find.py 86.66% <0.00%> (ø)
docarray/array/storage/annlite/find.py 93.33% <0.00%> (+10.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aad1e1f...1ce1433. Read the comment docs.



def run_benchmark(
X_tr, X_te, dataset, n_index_values, n_vector_queries, n_query, storage_backends
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use more meaningful variable names than X_tr and X_te. maybe just test and train ?

X_tr, X_te, dataset, n_index_values, n_vector_queries, n_query, storage_backends
):
table = Table(
title=f'DocArray Benchmarking n_index={n_index_values[-1]} n_query={n_query} D={D} K={K}'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess since n_index_values always contain 1 element, maybe it shouldn't be a list

console.print(f'\treading {n_query} docs ...')
read_time, _ = read(
da,
random.sample([d.id for d in docs], n_query),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to have the same query for all backends

Comment on lines +136 to +138
ground_truth = [
x for x in dataset['neighbors'][0 : len(vector_queries)]
]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put this in a higher scope

find_by_condition_time, _ = find_by_condition(
da, {'tags__i': {'$eq': 0}}
)
if idx == len(n_index_values) - 1:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this check once n_index_values becomes 1 value instead of list

@alaeddine-13
Copy link
Copy Markdown
Member

btw, it looks like sift dataset needs euclidean distance

@hanxiao hanxiao linked an issue Apr 28, 2022 that may be closed by this pull request
@JoanFM
Copy link
Copy Markdown
Member

JoanFM commented Jun 21, 2022

Closing until further notice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor benchmarks to use a dataset instead of random data

3 participants