Skip to content

brownea/ezdata

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

169 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ezData - A simplistic column based data framework.

tested with github workflow:

compatible with many existing dataframes: e.g. pandas

requirements: numpy, matplotlib (for plotting only) conversion to other formats require the appropriate library.

.. notes::

* requirements: numpy, matplotlib
* conversion to other formats require the appropriate library.
* some additional wrappers around bokeh, plotly and holoviews are in dev.

:author: Morgan Fouesneau

Documentation and API: link

Example notebook in the examples directory

nbviewer

As this package is rapidly evolving some features are only presented in the examples.

Installation

pip install git+https://github.com/mfouesneau/ezdata (--upgrade)

Why?

I always found myself writing snippets around numpy, matplotlib, pandas and other file readers. These are often the same things: read file foo and plot a against b where something is takes some values. It gets always very complex when you want to make something non-standard, for instance, for each of the 10 classes given according to this selection, make a scatter plot with these specific markers and color coded by another column.

I was basically tired of all the packages doing fancy things and not allowing basics or requiring a lot of dependencies.

This package initially focused on easily manipulating column oriented data. In particular this package allows easy conversions to many common dataframe containers: dict, numpy.recarray, pandas.DataFrame, dask.DataFrame, astropy.Table, xarray.Dataset, vaex.DataSetArrays.

I extended this package to allow myself to plot these data in the a very simple manner. Of course this was not covering all needs and thus I added interfaced to holoviews/datashader.

What is this package?

Based on the most basic functions and in particular methods of dict, I wrote this package. This basically builds advance-ish access to column oriented data through 4 main classes, 2 of which handle data and the others plotting shortcuts. This may not fit all needs, nor large data access.

  • dictdataframe: an advanced dictionary object. A simple-ish dictionary like structure allowing usage as array on non constant multi-dimensional column data. The :class:DataFrame container allows easier manipulations of the data but is basically a wrapper of many existing function around a dictionary object.

  • simpletable: a simplified version of ezTables The :class:SimpleTable allows easier manipulations of the data but is basically a wrapper of many existing function around a numpy.recarray object. It implements reading and writing ascii, FITS and HDF5 files. The :class:AstroTable built on top of the latter class, adds-on astronomy related functions, such as conesearch

  • plotter: this package implements :class:Plotter, which is a simple container to dictionary like structure (e.g. :class:dict, :class:np.recarray, :class:pandas.DataFrame, :class:SimpleTable). It allows the user to plot directly using keys of the data and also allows rapid group plotting routines (groupy and facets). Note that is also allows expressions instead of keys. This interface should basically work on any dictionary like structure

  • DSPlotter: extends :class:Plotter, to use datashader for some plots allows expressions instead of keys. This interface requires holoview and datashader

Both data structures implements common ground base to line and column access in the same transparent manner. These objects implement for instance array slicing, shape, dtypes on top of which they implement functions such as: sortby, groupby, where, join and evaluation of expressions as keys. (see examples below). Both also have a direct access to a Plotter attribute. DSPlotter is experimental and requires more than basic libraries so that it needs to be called on the side.

The data classes allows easy conversions to many common dataframe containers: numpy.recarray, pandas.DataFrame, dask.DataFrame, astropy.Table, xarray.Dataset, vaex.DataSetArrays.

Examples

  • Some data manipulation basics
    >>> t = SimpleTable('path/mytable.csv')
    # get a subset of columns only
    >>> s = t.get('M_* logTe logLo U B V I J K')
    # set some aliases
    >>> t.set_alias('logT', 'logTe')
    >>> t.set_alias('logL', 'logLLo')
    # make a query on one or multiple column
    >>> q = s.selectWhere('logT logL', '(J > 2) & (10 ** logT > 5000)')
    # note that `q` is also a table object
    # makes a simple plot (see :module:`plotter`)
    >>> q.Plotter.plot('logT', 'logL', ',')
    # export the initial subtable to a new file
    >>> s.write('newtable.fits')
    # or 
    >>> s.write('newtable.hd5')
  • Convert to other dataframe structures
    >>> t = SimpleTable('path/mytable.csv')
    >>> t.to_pandas()
    >>> t.to_dask(npartitions=5)
    >>> d = DictDataFrame(t)
  • Make a single plot of 'RA', 'DEC' on which each region 'BRK' is represented by a different color (colormap or other) and different marker.

    >>> p = t.Plotter.groupby('BRK', markers='<^>v.oxs', colors='parula')
    >>> p.plot('CRA', 'CDEC', 'o')
    >>> import pylab as plt
    >>> plt.legend(loc='best', numpoints=1)
    >>> plt.xlim(plt.xlim()[::-1])
    >>> plt.xlabel('RA')
    >>> plt.ylabel('DEC')
  • make a more complex plot: plot the histogram distribution of 'AV' per region given by 'BRK', with given color scheme per region value and individual plots with shared axis

    >>> t.Plotter.groupby('BRK', facet=True, \
            colors=plt.cm.parula, sharex=True, \
	    sharey=True).hist('AV', 
	    bins=np.linspace(t.AV.min(), 
	    t.AV.max(), 20), normed=True)
    >>> for ax in plt.gcf().axes[-3:]: ax.set_xlabel('AV')

About

simplistic column based data framework. interface to tables/dict with plotting function

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%