Refactor the analysis result loading and saving mechanism #5

KamranBinaee · 2021-07-08T00:43:18Z

The data load/save mechanism and formats are although working but very slow and inefficient. For instance, the saved list of dicts or dicts of dicts are extremely slow and inefficient.

Figure out whether the format that pupil saves to is the way to go during run time or simple pandas pickling mechanism is the way to go.
What data formats should be preserved and what formats should be changed i.e. pupil, marker positions dictionaries?
What about the other test code that I used before which utilizes the hdf5 format.
[ performance evaluation of different formats at least the top 2-3 candidates ]
[ Implementing the saving first ]
[ Implementing the reading in next]
[Test the fail cases i.e. mid analysis failure and follow where it was left off]

marklescroart · 2021-07-08T04:02:18Z

If there is a problem with loading the outputs of the gaze pipeline being slow, it may lie with the loader. The following code should efficiently load the .npz files that are saved by some steps of gaze analysis. This takes ~1.5 seconds on the file for which I tested it; it will go faster (~650 ms) if the files are converted to a dict of arrays before saving with np.savez() as output from the gaze pipeline (the extant function data_analysis.gaze.gaze_utils.dictlist_to_arraydict can be used to convert from lists of dicts to dicts of arrays). Pandas loading from a csv file takes ~550 ms; slightly faster, likely due to the fact that csv files are not compressed, and are 60% bigger on disk (~78 MB vs ~49 MB).

The final decision for format should optimize the utility of the output for whatever analyses come next. DataFrames are a fine format.

import pandas
import data_analysis
import numpy as np

def load_gaze(gaze_file):
    """"""
    gaze_data = np.load(gaze_file, allow_pickle=True)
    dict_list = gaze_data['gaze_binocular']
    data = data_analysis.gaze.gaze_utils.dictlist_to_arraydict(dict_list)
    data['norm_pos_x'], data['norm_pos_y'] = data.pop('norm_pos').T
    return pandas.DataFrame(data=data)

KamranBinaee self-assigned this Jul 8, 2021

KamranBinaee added the enhancement New feature or request label Jul 8, 2021

KamranBinaee added this to the Data Loading milestone Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the analysis result loading and saving mechanism #5

Refactor the analysis result loading and saving mechanism #5

KamranBinaee commented Jul 8, 2021

marklescroart commented Jul 8, 2021

Refactor the analysis result loading and saving mechanism #5

Refactor the analysis result loading and saving mechanism #5

Comments

KamranBinaee commented Jul 8, 2021

marklescroart commented Jul 8, 2021