Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a new file format and implement InputFormat and RecordReader for it #77

Open
beomyeol opened this issue Aug 19, 2015 · 19 comments
Open

Comments

@beomyeol
Copy link
Contributor

For various datasets, the data is stored in different file formats. For example, the data of MNIST database is saved in their own file format and the data of CIFAR-10 database is stored in Python pickle file and their own file format. Supporting all these file format is too burdensome.
So, I suggest defining a new file format which our DNN uses to load data from file. In order to support a variety of datasets such as MNIST and ImageNet, we can convert these datasets to our file format and provide them for DNN.
After define the new file format, we also need InputFormat and RecordReader for it to run our neural network on REEF.

@beomyeol beomyeol changed the title Define new file format and implement InputFormat and RecordReader for it Define a new file format and implement InputFormat and RecordReader for it Aug 19, 2015
@dongjoon-hyun
Copy link
Contributor

I agree with your concerns. Dolphin is a framework, not an application. But, at this point, I have a concern about reinventing the wheel. Why don't you use ND4J serialization? If you agree with using ND4J format, @swlsw and I will make a converter for that. Let's assume like the following simply.

  1. All data have a matrix form using INDArray interface.
  2. Training / Cross Validation / Test Data Set will be loaded as a m x n matrix.
    • m is a number of instance.
    • n is a number of feature.
  3. Label Data Set will be loadded as a m x 1 vector.

By the way, in reality, we do not use both MNIST vector format or CIFAR-10 pickle file. We mostly use the only original files like JPEG.

  • For MNIST, the original data is JPEG file. (Black and White)
    The folder structure is described in Support Multiple Data Sources #63. (hdfs://data/image/mnist/jpg)
    @jsjason , could you download this from SKT cluster into your cluster, if you need?
    You can use SKT cluster itself too.
  • For ImageNet, the original data is also JPEG file. (RGB Color)
    Also, you can download them in our cluster (hdfs://data/image/imagenet/tar)
    You can connect through VPN. Please get one from @jsjason .

@beomyeol
Copy link
Contributor Author

Thank you for your suggestion, @dongjoon-hyun. I will consider ND4J serialization and discuss this with @jsjason. If it is okay to use it, I will let you know and start implementing it.

I can connect to SKT cluster through VPN. Thanks to @jsjason.

@dongjoon-hyun
Copy link
Contributor

Thank you for considering. By the way, I found that the following codes in DL4J and ND4J. Actually, the file is plain text file delimeted spaces. three-spaces : " "

DL4J

        ClassPathResource resource = new ClassPathResource("/mnist2500_X.txt");
        File f = resource.getFile();
        INDArray data = Nd4j.readNumpy(f.getAbsolutePath(),"   ").get(NDArrayIndex.interval(0,100),NDArrayIndex.interval(0,784));

ND4J

    /**
     * Read line via input streams
     *
     * @param filePath the input stream ndarray
     * @param split    the split separator
     * @return the read txt method
     */
    public static INDArray readNumpy(String filePath, String split) throws IOException {
        return readNumpy(new FileInputStream(filePath), split);
    }

    /**
     * Read line via input streams
     *
     * @param filePath the input stream ndarray
     * @return the read txt method
     */
    public static INDArray readNumpy(String filePath) throws IOException {
        return readNumpy(filePath, "\t");
    }

I think we already have Numpy compatible read function in ND4J.

@beomyeol
Copy link
Contributor Author

Thank you for letting me know readNumpy(). But, I have a concern. Is it okay to use plain text file with delimiter? Using a plain text file needs more space than a binary file.

In addition, I saw the code of readNumpy() in ND4J library. It supports Numpy compatible plain text file, but does not support Numpy compatible binary file such as .npy or .npz.

@dongjoon-hyun
Copy link
Contributor

Yep. That is right. But I think we can depend on that part in ND4J layer.
If we design our architecture having ND4J layer that handles readNumpy, the converting job for numpy is a piece of cake. We can implement a converter for numpy as a just a small python script with opening .npy and storing .txt. :)

@dongjoon-hyun
Copy link
Contributor

By the way, for the efficiency, we have to distinguish between input file format and internal storage format. The followings are my opinions until now.

  • For the input format, just call readNumpy().
  • For the internal storage format, just use ND4J serialization.

@jsjason
Copy link
Contributor

jsjason commented Aug 20, 2015

@dongjoon-hyun When you say 'internal storage format', are you referring to the intermediate and final output data?

@jsjason
Copy link
Contributor

jsjason commented Aug 20, 2015

One thing I am concerned about is our dependency on ND4J. I don't know much about scientific computing libraries, but is it okay to rely on ND4J this much? We could search for and use a library with a greater community.

@beomyeol
Copy link
Contributor Author

@dongjoon-hyun,
Okay, we can decide to use a plain text format as input format. I have one more concern about it.
REEF does not support multiple data sources now as we discussed in #63. We need to put images and labels into a single text file and consider this file format. I think about following format.

(image) (delimiter) (label) (newline)
(image) (delimiter) (label) (newline)
...
(image) (delimiter) (label) (newline)

By using readNumpy(), image and label data can be loaded and we can set ',' as the delimiter, for example. Is this format fine to use? If so, I don't think we need custom InputFormat and RecordReader. We can just use TextInputFormat.

@bgchun
Copy link
Contributor

bgchun commented Aug 20, 2015

@beomyeol It'd be nice to resolve #63 eventually. But if #63 takes time, we should address it later since there are other more important issues.

@dongjoon-hyun
Copy link
Contributor

@jsjason , I meant 'internal storage format' for really dolphin's internal format, if needed. It's not output format.

For dependency, I always welcome your further research and proposal for better BLAS library supporting CPU/GPU. :)

@dongjoon-hyun
Copy link
Contributor

Ur, @beomyeol , I meant float matrix for dolphin. Sorry for making you confuse. All image/sound/text data will be transformed by me and @swlsw into numpy matrix for dolphin. dolphin has no need to care about that. What I described above is the real final application goal. For dolphin nueral network algorithm, you can assume that float matrix as a input and perform mathematical operation only.

@dongjoon-hyun
Copy link
Contributor

For the train/test data and label, you can read with the similar way as you described, i.e., m x (n + 1) matrix.

  • m : the number of data instance
  • n : the number of features
  • 1 : the last column is the label column.

In addition, dolphin should load pre-trained model. This is more important. Do you have any idea for this?

@jsjason
Copy link
Contributor

jsjason commented Aug 20, 2015

The pre-trained model equals the initial parameter set for the DNN case, right? Unlike the other algorithms, for DNNs we are trying to provide a ParameterInitializer that generates the initial values for edge weights and biases.

@beomyeol
Copy link
Contributor Author

@dongjoon-hyun. I am still confused a little bit. What is the format of file which dolphin loads? Is it a Numpy compatible plain text file format like 'mnist2500_X.txt' in DL4J?

in addition, for pre-trained model, I have not thought about it yet. We may need a snapshot feature of neural network and a feature of reconstructing neural network from the the snapshot. I'd like to discuss this as a separate issue.

@dongjoon-hyun
Copy link
Contributor

@jsjason , that's right. ParameterInitializer sounds Good!

@dongjoon-hyun
Copy link
Contributor

@beomyeol . For the first question, Yes. For the second question, @jsjason answered in the previous comment.

@jsjason
Copy link
Contributor

jsjason commented Aug 20, 2015

Thanks, @beomyeol and @dongjoon-hyun. Let's keep this issue open since we'll probably going to have more discussions when PRs starts to come up.

@beomyeol
Copy link
Contributor Author

Thank @dongjoon-hyun for you comment :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants