Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the analysis result loading and saving mechanism #5

Open
KamranBinaee opened this issue Jul 8, 2021 · 1 comment
Open
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@KamranBinaee
Copy link
Contributor

The data load/save mechanism and formats are although working but very slow and inefficient. For instance, the saved list of dicts or dicts of dicts are extremely slow and inefficient.

  • Figure out whether the format that pupil saves to is the way to go during run time or simple pandas pickling mechanism is the way to go.

  • What data formats should be preserved and what formats should be changed i.e. pupil, marker positions dictionaries?

  • What about the other test code that I used before which utilizes the hdf5 format.

  • [ performance evaluation of different formats at least the top 2-3 candidates ]

  • [ Implementing the saving first ]

  • [ Implementing the reading in next]

  • [Test the fail cases i.e. mid analysis failure and follow where it was left off]

@marklescroart
Copy link
Contributor

If there is a problem with loading the outputs of the gaze pipeline being slow, it may lie with the loader. The following code should efficiently load the .npz files that are saved by some steps of gaze analysis. This takes ~1.5 seconds on the file for which I tested it; it will go faster (~650 ms) if the files are converted to a dict of arrays before saving with np.savez() as output from the gaze pipeline (the extant function data_analysis.gaze.gaze_utils.dictlist_to_arraydict can be used to convert from lists of dicts to dicts of arrays). Pandas loading from a csv file takes ~550 ms; slightly faster, likely due to the fact that csv files are not compressed, and are 60% bigger on disk (~78 MB vs ~49 MB).

The final decision for format should optimize the utility of the output for whatever analyses come next. DataFrames are a fine format.

import pandas
import data_analysis
import numpy as np

def load_gaze(gaze_file):
    """"""
    gaze_data = np.load(gaze_file, allow_pickle=True)
    dict_list = gaze_data['gaze_binocular']
    data = data_analysis.gaze.gaze_utils.dictlist_to_arraydict(dict_list)
    data['norm_pos_x'], data['norm_pos_y'] = data.pop('norm_pos').T
    return pandas.DataFrame(data=data)

@KamranBinaee KamranBinaee self-assigned this Jul 8, 2021
@KamranBinaee KamranBinaee added the enhancement New feature or request label Jul 8, 2021
@KamranBinaee KamranBinaee added this to the Data Loading milestone Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants