- Support
param
version 2 - Added
pydrobert.torch.argcheck
and standardized module argument checking a bit. - Added
environment.yml
for local dev. LookupLanguageModel
has been refactored and reimplemented. It is no longer compatible with previous versions of the model.parse_arpa
has been enhanced: it can handle log-probs in scientific notation (e.g.1e4
), conversion from (implicitly) log-base-10 to log-base-e probabilities; and storing log probabilities as Numpy floating-point types.LookupLanguageModel
andparse_arpa
now have an optional logger argument to log progress building the trie and parsing the file, respectively.- Added
best_is_train
flag toTrainingController.update_for_epoch
- Refactored
get-torch-spect-data-dir-info
to be faster subset-torch-spect-data-dir
command has been addedprint-torch-{ali,ref}-data-dir-length-moments
commands have been addedLookupLanguageMode.prob_list
has been renamed toprob_dicts
- Added
ShallowFusionLanguageModel
,ExtractableShallowFusionLanguageModel
, andMixableShallowFusionLanguageModel
- Slicing and chunking modules
SliceSpectData
,ChunkBySlices
, andChunkTokenSequenceBySlices
, as well as the commandchunk-torch-spect-data-dir
, which puts them together. - Code for handling TextGrid files, including the functions
read_textgrid
andwrite_textgrid
, as well as the commandstorch-token-data-dir-to-textgrids
andtextgrids-to-torch-token-data-dir
. - Commands for switching between ref and ali format:
torch-ali-data-dir-to-torch-token-data-dir
andtorch-token-data-dir-to-torch-ali-data-dir
. - Added py 3.10 support; removed py 3.6 support.
- Initial (undocumented) support for
PyTorch-Lightning in
pydrobert.torch.lightning
submodule. Will document when I get some time. - Refactored much of
pydrobert.torch.data
. Best just to look at the API.ContextWindowEvaluationDataLoader
,ContextWindowTrainingDataLoader
,SpectEvaluationDataLoader
,SpectTrainingDataLoader
,DataSetParams
,SpectDataSetParams
, andContextWindowDataSetParams
are now deprecated. The data loaders have been simplified toContextWindowDataLoader
andSpectDataLoader
. Keyword arguments (likeshuffle
) now control their behaviour. The*DataSetParams
have been renamed*DataLoaderParams
with some of the parameters moved around. Notably,LangDataParams
now storessos
,eos
, andsubset_ids
parameters, from which a number of parameter objects inherit.SpectDataLoaderParams
inherits fromLangDataLoaderParams
, which in turn inherits fromDynamicLengthDataLoaderParams
. The latter allows the loader's batch elements to be bucketed by length using the newBucketBatchSampler
. It and a number of other samplers inherit fromAbstractEpochSampler
to help facilitate the simplified loaders and better resemble the PyTorch API. Mean-variance normalization of features is possible through the loaders and the newMeanVarianceNormalization
module.LangDataSet
andLangDataLoader
have been introduced to facilitate language mdoel training. Finally, loaders (and samplers) are compatible withDistributedDataParallel
environments. - Mean-variance statistics for normalization may be estimated from a data
partition using the command
compute-mvn-stats-for-torch-feat-data-dir
. - Added
torch-spect-data-dir-to-wds
to convert a data dir to a WebDataset. - Changed method of constructing random state in
EpochRandomSampler
. Rerunning training on this new version with the same seed will likely result in different results from the old version! FeatureDeltas
now a module, in case you want to compute them online rather than waste disk space.- Added
PadMaskedSequence
. - Added
FillAfterEndOfSequence
. - Added
binomial_coefficient
,enumerate_binary_sequences
,enumerate_vocab_sequences
, andenumerate_binary_sequences_with_cardinality
. - Docstrings updated to hopefully be clearer. Use "Call Parameters" and "Returns" sections for pytorch modules.
- readthedocs updated.
- Fixed up formatting of CLI help documentation.
- Data sets can now initialize some of their parameters with the values in
their associated param containers. For example,
sos
andeos
are now set inSpectDataSet
by passing an optionalSpectDataParam
instance. The old method (by argument) is now deprecated. - Renamed
DataSetParams
toDataLoaderParams
and deprecated former naming to better mesh with their use in data loaders. - Moved
pydrobert.torch.util.parse_arpa_lm
topydrobert.torch.data
SimpleRandomSamplingWithoutReplacement
has been added as a new distribution.EnumerateEstimator
,ImportanceSamplingEstimator
, andIndependentMetropolisHastingsEstimator
have been added as a new estimators.pydrobert.torch.estimators
has been rewritten from the ground-up, with old functionality deprecated. Distribution-related functions have been rewritten astorch.distributions.Distribution
classes implementing aConditionalStraightThrough
interface and stored inpydrobert.torch.distributions
. The REINFORCE and RELAX estimators now have an object-oriented interface subclassingMonteCarloEstimator
asDirectEstimator
andRelaxEstimator
, respectively. The REBAR control variate is now distribution-specific and found inpydrobert.torch.modules
.- Bug fixes to
OptimalCompletion
andHardOptimalCompletionDistillationLoss
involving batch sizes. - Refactored code to move modules to
pydrobert.torch.modules
and functions topydrobert.torch.functional
. - Deprecated
pydrobert.torch.layers
andpydrobert.torch.util
. - Added a number of modules to
pydrobert.torch.modules
to wrap functional API. Moved docstrings to modules. - Fixed a problem with
warp_1d_grid
/SpecAugment
which made it sensitive to the length of other elements in the batch. - Added compatibility wrappers to avoid warnings across supported pytorch versions.
- Refactored code and added tests to support JIT tracing and scripting for most functions/modules in pytorch >= 1.8.1. before the next release. I'll write up documentation shortly.
- Added
pydrobert.torch.config
to store constants used in the module. - Removed
setup.py
. - Deleted conda recipe in prep for conda-forge.
- Compatibility/determinism fixes for 1.5.1.
- Bump minimum PyTorch version to 1.5.1. Actually testing this minimum!
version.py
->_version.py
.- A number of modifications and additions related to decoding and language
models, including:
beam_search_advance
andrandom_walk_advance
have been simplified, with much of the end-of-sequence logic punted to their associated modules.- Rejigged
SequentialLanguageModel
andLookupLanguageModel
to be both simpler and compatible with decoder interfaces. ctc_greedy_search
andctc_prefix_search_advance
functions have been added.ExtractableSequentialLanguageModel
,MixableSequentialLanguageModel
,BeamSearch
,RandomWalk
, andCTCPrefixSearch
modules have been added.- A
SequentialLanguageModelDistribution
wrappingRandomWalk
which implements PyTorch'sDistribution
interface. Language models now work with estimators! - A new documentation page on how to deal with all of that.
- Fixed bug in controller that always compared thresholds against best, not the last point that reset the countdown (#55)
- Added
pad_variable
andRandomShift
(#54) - Modified
error_rate
,prefix_error_rates
to actually compute error rates when non-default costs are used. Old functionality is now inedit_distance
andprefix_edit_distances
(#51) - Fixed bug in how padding is handled in string matching utilities.
- Fixed logic errors in
compute-torch-token-data-dir-error-rates
(#50) - Modified frame end in
pydrobert.torch.data.transcript_to_token
and added some notes on the ambiguity of the conversion. - Added some more checks and a 'fix' flag to
pydrobert.torch.data.validate_spect_data_set
. Entryget-torch-spect-data-dir-info
now has--fix
flag, too.
A considerable amount of refactoring occurred for this build, chiefly to get
rid of Python 2.7 support. While the functionality did not change much for this
version, we have switched from a pkgutil
-style pydrobert
namespace to
PEP-420-style namespaces. As a result, this package is not
backwards-compatible with previous pydrobert
packages! Make sure that if any
of the following are installed, they exceed the following version thresholds:
pydrobert-param >0.2.0
pydrobert-kaldi >0.5.3
pydrobert-speech >0.1.0
Miscellaneous other stuff:
- Type hints everywhere
- Shifted python source to
src/
- Black-formatted remaining source
- Removed
future
dependency - Shifted most of the configuration to
setup.cfg
, leaving only a shell insetup.py
to remain compatible with Conda builds - Added
pyproject.toml
for PEP 517. tox.ini
for TOX testing- Switched to AppVeyor for CI
- Added changelog :D