Skip to content

Improve peformances put/get rows using pandas.DataFrame#19

Merged
knonomura merged 2 commits intogriddb:masterfrom
dangtrungtin:master
Sep 21, 2020
Merged

Improve peformances put/get rows using pandas.DataFrame#19
knonomura merged 2 commits intogriddb:masterfrom
dangtrungtin:master

Conversation

@dangtrungtin
Copy link
Copy Markdown
Contributor

  • Add function : void Container.put_rows(pandas.DataFrame input). In python layer, I convert input from pandas.DataFrame to numpy.array because NumPy support C-API. In C++ layer, I use API from NumPy to put data into GridDB. Compare with using Container.multi_puts(input : list[list]) to put large data with LONG type, the time to run reduce about 11% and memory using reduce 20%. With String type, the time to run reduce about 10% and memory using reduce about 8%.

  • Add function : pandas.DataFrame RowSet.fetch_rows(). In C++ layer, it uses Iterable object (RowList.h/cpp) to wrap output data. In python layer, I convert data from Iterable object to pandas.DataFrame. Compare with using RowSet.next() to query large data with LONG type, the time to run reduce 17% and memory using are the same. When query large data with STRING type, the time to run reduce 11%.

  • Reduce call function to check NULL field: when get data, for each row field, Python Client is using gsGetRowFieldNull() then gsGetRowFieldAsXXX(). I change to use gsGetRowFieldAsXXX(), then if data is empty or null then I use gsGetRowFieldNull() to check whether field is null.

  • There is a note for Container.put_rows(). To create DataFrame, we use: "frame = pandas.DataFrame(data)" with "data" is list. However, when list has None value, Pandas library will automatic change value, for example None value to NaN value. To prevent this, in python code should use "frame = pandas.DataFrame(data, dtype=object)".

- Add function: void Container.put_rows(input: pandas.DataFrame)
- Add function: pandas.DataFrame RowSet.fetch_rows()
- Reduce call function to check NULL field
@knonomura
Copy link
Copy Markdown
Member

Oh, that's great !
I'll try to use it.

I have a question.
How much data did you use?

Thanks.

@dangtrungtin
Copy link
Copy Markdown
Contributor Author

With long type, I put 1000 rows x 10000 fields. With string type, I use 1000 rows x 7552 fields.

@knonomura
Copy link
Copy Markdown
Member

Thank you for your information.
I understand.

@knonomura
Copy link
Copy Markdown
Member

I have a request.
Could you please add a sample for new function ?

- PutRowsWithDataFrame.py : sample for put rows.
- FetchRowsWithDataFrame.py : sample for fetch rows.
@dangtrungtin
Copy link
Copy Markdown
Contributor Author

I added 2 samples:

  • PutRowsWithDataFrame.py : sample for put rows.
  • FetchRowsWithDataFrame.py : sample for fetch rows.

@knonomura
Copy link
Copy Markdown
Member

Thank you for your samples.
I'll check them.

@knonomura
Copy link
Copy Markdown
Member

I guess this pull request is very useful.
So I merge it.
Thank you.

@knonomura knonomura merged commit eda9482 into griddb:master Sep 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants