Skip to content

Latest commit

 

History

History
119 lines (85 loc) · 6.97 KB

TFRecord.md

File metadata and controls

119 lines (85 loc) · 6.97 KB

The TFRecord.md is for learning Tensorflow Records. What they are? and how they can be used?
This document highly the most important information which is descripting in the references, so I strongly recomment reader to get more details, please read the references.

Overview

Tensorflow is a second machine learning framwork that Google created for researching and developing AI/ML or DNN application, it is widely used in both academic and engineering society. TFRecord is a kind of Tensorflow's own binary format.

Advantages:
If you are working with large datasets, using a binary format for storaging and reading of your data can hav significan impact on the performance of your import and consequence on the training time of your model.
👍 Binary format take up less space on your disk.
👍 Binary format take less time to copy and can be read much more efficiently from disk.
👍 It easy to combine multiple datasets and intergrates seamlessly with the data import and preprocessing functionality provided by library.
👍 For dataset that are too large to be stored fully in memory this is an advantage as only the data that is require at the time (eg. batch) is loaded form disk and then processed.
👍 It's possible to store sequence data


Disadvantages:
👎 You have to convert your data to this format and there are quite little documents for descripting fully how to do that task.

Methodology

A TFRecord file store your data as a sequence of binary strings. This mean that you need to desclare theis format before you can write it in to files.
Tensorflow give us two APIs to do this purpose:

  • tf.train.Exampletf.train.Example
  • tf.train.Exampletf.train.SequenceExample
    And then you can use:
  • tf.train.Exampletf.python_io.TFRecordWriter
    to write them to your disk.

How to use?

tf.train.Example

If you have dataset consist of feature, where each feature is a list of value of the same type, tf.train.Example is a right way to use
the movie recomendation application:

Age Movie Movie Ratings Surggestion Surggestion Purchased Purchase Price
29 The Shawshank Redemption 9.0 inception 1.0 9.99
Fight Club 9.7

It's clearly seen that we now have list of features, each of them have same type, like for example:

  • feature Age is integer
  • feature Movie is string.
  • feature Movie Ratings is real number
  • feature Surggestion is string
  • feature Surggestion Purchased is real number
  • feature Purchase Price is real number

We need to create the list that consitute the features by using:

  • tf.train.Exampletf.train.BytesList
  • tf.train.Exampletf.train.FloatList
  • tf.train.Exampletf.train.Int64List

movie_name_list = tf.train.BytesList(value=[b'The Shawshank Redemption', b'Fight Club'])

movie_rating_list = tf.train.FloatList(value=[9.0, 9.7])

python string need to be converted to bytes before they are stored in

  • tf.train.BytesList

movie_names = tf.train.Feature(bytes_list=movie_name_list)

movie_ratings = tf.train.Feature(float_list=movie_rating_list)

collect all named features by using

  • tf.train.Features

movie_dict = {'Movie Names: movie_names, Movie Ratings: movie_ratings}

movies = tf.train.Features(feature=movie_dict)

Write to disk by using

  • tf.python_io.TFRecordWrite

with tf.python_io.TFRecordWriter('movie_ratings.tfrecord) as writer: writer.write(example.SerializeToString())

tf.train.SequenceExample

If you have features that consist of list of identically typed data and maybe some contextual data

Data
Movie 1 Movie 2
Movie name The shawshank Redemption Fight Club
Movie Rating 4.5 5
Actor Tim Robins Brad Pitt
Morgan Freeman Edward Norton
Helena Bonham Carter
Context
Locale Age Favorites
"pt_BR" 19.0 Majesty Rose
Savannah Outen
One Direction

In this example we have a number of context features such as Locale, Age, Favorites and a list of movies recomendation which consist of Movie name, Movie Rating, Actor .

The data look very similar to the previous example, where each feature consisted of a single list. Each entry in the list represented the same information in different movie, for example:

  • Movie Raing

But now we also have Actor, this type cannot be stored in a tf.train.Example. We need different type of structure for this kind of data, that is tf.train.SequenceExample it have two attributes:

  • context of type tf.train.Features
  • features_list of type tf.train.FeatureLists

Data from Context table we stores in tf.train.Features, data from Movie name, Movie Rating, Actor we stores in tf.train.FeatureLists

Practicles

Reference

[1] Tensorflow Records? What they are and how to use them
[2] How to use TFRecord with Datasets and Iterators in Tensorflow with code samples
[3] TensorFlow Tutorial For Beginners
[4] Using TFRecords and tf.Example