Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WordSeg] Add inline documentation #566

Merged
merged 35 commits into from
Jun 12, 2020

Conversation

texasmichelle
Copy link
Member

@texasmichelle texasmichelle commented May 27, 2020

Primarily adds inline documentation to WordSeg components. Also:

  • Moves CharacterErrors into CharacterSequence.swift, where it is primarily used.
  • Conforms naming to Swift conventions
  • Renames WordSegRecord -> Phrase
  • In WordSegDataset
    • Combines several functions
    • Removes a raw loop
    • Replaces optional members with empty arrays
    • Renames DownloadDetails -> DownloadableArchive and convertDataset() -> numericalizeDataset()
  • Adds a test for running with training data only

Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Support/Text/WordSeg/Lexicon.swift Show resolved Hide resolved
@texasmichelle texasmichelle marked this pull request as ready for review May 29, 2020 00:01
Copy link
Contributor

@dabrahams dabrahams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary comments in preparation for our session today.

Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Rename DownloadDetails to ReferenceArchive
Combine URL with filename and extension
Update main summary in WordSegDataset
Add blank line before doc comments in WordSegDataset
Remove explicit parameter descriptions and add them to summaries.
Handle errors instead of throwing.
Remove CharacterErrors.nonUtf8Data.
Update attribute names in dataset tests.
Update summary in Phrase to include parameter names.
Remove explicit parameter descriptions and add them to summaries.
Conform parameter names to Swift conventions.
public let plainText: String

/// A sequence of text in numeric form, derived from `plainText`.
public let numericalizedText: CharacterSequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember wanting to mention this before, but if “Character” doesn't mean what in Swift is called Character, we should look for other names, e.g. “glyph.”

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a start, I created #600 for using Character instead of String.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't exactly address this, but heads in the direction of cleaning up that design overall.

public let numericalizedText: CharacterSequence

/// Creates an instance containing both raw (`plainText`) and processed
/// (`numericalizedText`) forms of a sequence of text.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always thought “numericalized” read terribly awkwardizedly. I get the impression this is a term of art, but we should discuss whether it's the best choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #598

Copy link
Contributor

@dabrahams dabrahams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, let's sync up via GVC; I didn't get to the end, but there's loads of great material for the seminar on Friday. Let's talk about exactly how much/whether you want to fix things up before then. It might be best to just get your brain going on some of the questions so that the pace of the dialogue Friday is high, but leave all the issues in place.

Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
///
/// - Note: Omits any part of the dataset that cannot be converted to
/// `CharacterSequence`.
private static func convertDataset(_ dataset: [String], alphabet: Alphabet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, you're right, there's loads to work on here; I have 1000 thoughts about this: is it just numericalizing the strings in dataset, and don't we just want to name the function after that? Should it not be an extension on Array<String>? We should say what "using" means, …

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to numericalizeDataset

texasmichelle and others added 8 commits June 9, 2020 15:50
Remove unnecessary additional `load()`
Remove unnecessary optional from `testingPhrases` and `validationPhrases`
Simplify optional filename handling in init()
Remove extra `)` from training loss output
Add test for loading only training file
Simplify and remove redundant init code
alphabet: alphabet, appendingEoSTo: trimmed))
var phrases = [Phrase]()

for data in dataset {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dabrahams Same question here about how to compose this in a way that removes the raw loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could I use compactMap here if I need to include the original text? If I want to base my inclusion on whether CharacterSequence() results in nil, how can I include the original text in composing Phrase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a symptom of awkward design. I believe rethinking CharacterSequence will make these acrobatics unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    // untested of course
    return dataset.compactMap { data in
      let trimmed = data.split(separator: " ", omittingEmptySubsequences: true).joined()
      let numericalizedText = try? CharacterSequence(alphabet: alphabet, appendingEoSTo: trimmed)
      return numericalizedText.map { Phrase(plainText: String(data), numericalizedText: $0 }
    }

@texasmichelle

@texasmichelle texasmichelle requested a review from BradLarson June 11, 2020 18:10
Copy link
Contributor

@xihui-wu xihui-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for diligently working on this. Will take as number-one documentation sample in future!

Datasets/WordSeg/WordSegDataset.swift Outdated Show resolved Hide resolved
@texasmichelle texasmichelle merged commit 20fa285 into tensorflow:master Jun 12, 2020
@texasmichelle texasmichelle deleted the wordseg_docs branch June 12, 2020 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants