Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrainingLoop: refactor progress printer and add CSVLogger #668

Merged
merged 5 commits into from
Sep 24, 2020
Merged

Conversation

xihui-wu
Copy link
Contributor

@xihui-wu xihui-wu commented Sep 15, 2020

This is the first step addressing of #663. Highlights of the change:

  1. Add StatisticsRecorder and ProgressPrinter as pre-registered callbacks to TrainingLoop. (as fastai does)
  2. Add CSVLogger callback.

Comment on lines 283 to 290
let statisticsRecorder = StatisticsRecorder(liveStatistics: true, metrics: [.loss] + metrics)
let progressPrinter = ProgressPrinter(liveStatistics: true)
self.statisticsRecorder = statisticsRecorder
self.progressPrinter = progressPrinter
self.callbacks = [
statisticsRecorder.record,
progressPrinter.print,
] + callbacks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine cases where we don't want to record per-batch or per-epoch statistics. By building statistics gathering directly into the training loop, rather than having it be a user-provided component, can we account for those cases where we want to strip down the training loop as much as possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please see latest change where I make both whether to include default callbacks and when to record stats configurable.

Copy link
Contributor

@dabrahams dabrahams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the first file. Please revise (and apply anything you learned from doing so to the other files) and I can look again. Thanks!

Examples/LeNet-MNIST/main.swift Outdated Show resolved Hide resolved
Support/FoundationFileSystem.swift Show resolved Hide resolved
import Foundation
import ModelSupport

/// A callback-based handler for logging the statistics to CSV file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice doc comment. English nit:

Suggested change
/// A callback-based handler for logging the statistics to CSV file.
/// A callback-based handler for logging statistics to a CSV file.

I would consider not leading with “callback-based.” In fact, "logging" and "CSV File" are kind of implied by the name. So the best description would explain what's being logged. “Statistics” is good, but what kind fo statistics? Training statistics, maybe? Maybe this should be thought of as “a log file” rather than “a logger?”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc comment. Any reason why this is LogFile? (Logger makes sense to me since its a logger not a file)

Copy link
Contributor

@dabrahams dabrahams Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thing can only be constructed to log to a path in the filesystem, and when we invoke its one public method, log, it appends to that file. Properly used, if you have multiple instances, each instance logs to a different file. So instances have a 1-1 correspondence to log files. The word Logger doesn't imply any of those things. A more general Logger might be an interesting abstraction, but it would have a different API.

If I see this code, I know exactly what's happening.

// No argument label needed, arguably, because when you construct a thing called “File” with
// a string, the string is obviously a path.
LogFile eventLog(s)

eventLog.append(blah, blah, blah)

With Logger and log, it's less clear.

TrainingLoop/Callbacks/CSVLogger.swift Show resolved Hide resolved
TrainingLoop/Callbacks/CSVLogger.swift Outdated Show resolved Hide resolved
TrainingLoop/Callbacks/CSVLogger.swift Outdated Show resolved Hide resolved
TrainingLoop/Callbacks/CSVLogger.swift Outdated Show resolved Hide resolved
TrainingLoop/Callbacks/CSVLogger.swift Outdated Show resolved Hide resolved
TrainingLoop/Callbacks/CSVLogger.swift Outdated Show resolved Hide resolved
"\n" + ([epoch, batch] + stats.map { String($0.1) }).joined(separator: ", ")
).data(using: .utf8)!
do {
try foundationFile.append(dataRow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it important to do this atomically if there are multiple writers to the same log file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does FileHandle support atomically write ? (Also, we won't expect multiple files written at same time I think)

@xihui-wu xihui-wu force-pushed the tl branch 2 times, most recently from 1c043d4 to c3b0082 Compare September 22, 2020 05:51
Support/FoundationFileSystem.swift Show resolved Hide resolved
Support/FoundationFileSystem.swift Show resolved Hide resolved
TrainingLoop/Callbacks/CSVLogger.swift Show resolved Hide resolved

/// Create an instance that log statistics during the training loop.
public init(withPath path: String = "run/log.csv") {
self.path = path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you “resolve” this comment?

TrainingLoop/Callbacks/CSVLogger.swift Show resolved Hide resolved
/// The last input fed to the model.
var lastInput: Input? { get set }
var lastStepInput: Input? { get set }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you see repetition like this, the instinct should be to replace it with something like:

var lastStep: Step?

// MARK: - Epoch-level data

/// The number of epochs we are currently fitting for.
var epochCount: Int? { get set }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these optional? Usually, 2-phase initialization indicates a design error.


// MARK: - Others

/// The log for last statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last what? We have last epoch, last step, …

// Callbacks
/// The callbacks used to customize the training loop.
public var callbacks: [TrainingLoopCallback<Self>] = []
/// The metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do better please


// MARK: - The callbacks used to customize the training loop.

public var callbacks: [TrainingLoopCallback<Self>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comments missing.

Copy link
Contributor

@dabrahams dabrahams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock progress but there are lots of things to fix, including many that are simply mechanical (missing doc comments).

@xihui-wu
Copy link
Contributor Author

Approving to unblock progress but there are lots of things to fix, including many that are simply mechanical (missing doc comments).

Thanks! Will come with new PRs to address:

  1. @dabrahams's open comments
  2. An easier interface for specifying when to compute stats (as per @BradLarson )
  3. Potential changes in metrics / loss as I have new findings when converting GPT2 to TrainingLoop. (this is also what I'm blocked on and wanted to make iteration on top of this PR change)

0712zwang added a commit to 0712zwang/swift-models that referenced this pull request Nov 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants