diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 192011c..c57dcde 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-10-28T17:47:15","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.1","generation_timestamp":"2024-11-04T22:36:07","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/dev/api/extractor/index.html b/dev/api/extractor/index.html index 43ec0bd..5c37cd7 100644 --- a/dev/api/extractor/index.html +++ b/dev/api/extractor/index.html @@ -34,7 +34,7 @@ ├── b: ArrayExtractor │ ╰── StableExtractor(NGramExtractor(n=3, b=256, m=2053)) ╰── c: DictExtractor - ╰── d: StableExtractor(ScalarExtractor(c=1.0, s=1.0))

See also: extract, stabilizeextractor.

source
JsonGrinder.stabilizeextractorFunction
stabilizeextractor(e::Extractor)

Returns a new extractor with similar structure as e, containing StableExtractor in its leaves.

Examples

julia> e = (a=ScalarExtractor(), b=CategoricalExtractor(1:5)) |> DictExtractor
+           ╰── d: StableExtractor(ScalarExtractor(c=1.0, s=1.0))

See also: extract, stabilizeextractor.

source
JsonGrinder.stabilizeextractorFunction
stabilizeextractor(e::Extractor)

Returns a new extractor with similar structure as e, containing StableExtractor in its leaves.

Examples

julia> e = (a=ScalarExtractor(), b=CategoricalExtractor(1:5)) |> DictExtractor
 DictExtractor
   ├── a: ScalarExtractor(c=0.0, s=1.0)
   ╰── b: CategoricalExtractor(n=6)
@@ -51,7 +51,7 @@
 julia> e_stable(Dict("a" => 0))
 ProductNode  1 obs, 0 bytes
   ├── a: ArrayNode(1×1 Array with Union{Missing, Float32} elements)  1 obs, 62 bytes
-  ╰── b: ArrayNode(6×1 MaybeHotMatrix with Union{Missing, Bool} elements)  1 obs, 62 bytes

See also: suggestextractor, extract.

source
JsonGrinder.extractFunction
extract(e::Extractor, samples; store_input=Val(false))

Efficient extraction of multiple samples at once.

Note that whereas extract expects samples to be an iterable of samples (of known length), calling the extractor directly with e(sample) works for a single sample. In other words, e(sample) is equivalent to extract(e, [sample]).

See also: suggestextractor, stabilizeextractor, schema.

Examples

julia> sample = Dict("a" => 0, "b" => "foo");
+  ╰── b: ArrayNode(6×1 MaybeHotMatrix with Union{Missing, Bool} elements)  1 obs, 62 bytes

See also: suggestextractor, extract.

source
JsonGrinder.extractFunction
extract(e::Extractor, samples; store_input=Val(false))

Efficient extraction of multiple samples at once.

Note that whereas extract expects samples to be an iterable of samples (of known length), calling the extractor directly with e(sample) works for a single sample. In other words, e(sample) is equivalent to extract(e, [sample]).

See also: suggestextractor, stabilizeextractor, schema.

Examples

julia> sample = Dict("a" => 0, "b" => "foo");
 
 julia> e = suggestextractor(schema([sample]))
 DictExtractor
@@ -64,7 +64,7 @@
   ╰── b: ArrayNode(2×1 OneHotArray with Bool elements)  1 obs, 60 bytes
 
 julia> e(sample) == extract(e, [sample])
-true
source
JsonGrinder.ExtractorType
Extractor

Supertype for all extractor node types.

source
JsonGrinder.LeafExtractorType
LeafExtractor

Supertype for all leaf extractor node types that reside in the leafs of the hierarchy.

source
JsonGrinder.StableExtractorType
struct StableExtractor{T <: LeafExtractor} <: LeafExtractor

Wraps any other LeafExtractor and makes it output stable results w.r.t. missing input values.

See also: stabilizeextractor.

source
JsonGrinder.ScalarExtractorType
ScalarExtractor{T} <: Extractor

Extracts a numerical value, centered by subtracting c and scaled by s.

Examples

julia> e = ScalarExtractor(2, 3)
+true
source
JsonGrinder.ExtractorType
Extractor

Supertype for all extractor node types.

source
JsonGrinder.LeafExtractorType
LeafExtractor

Supertype for all leaf extractor node types that reside in the leafs of the hierarchy.

source
JsonGrinder.StableExtractorType
struct StableExtractor{T <: LeafExtractor} <: LeafExtractor

Wraps any other LeafExtractor and makes it output stable results w.r.t. missing input values.

See also: stabilizeextractor.

source
JsonGrinder.ScalarExtractorType
ScalarExtractor{T} <: Extractor

Extracts a numerical value, centered by subtracting c and scaled by s.

Examples

julia> e = ScalarExtractor(2, 3)
 ScalarExtractor(c=2.0, s=3.0)
 
 julia> e(0)
@@ -73,7 +73,7 @@
 
 julia> e(1)
 1×1 ArrayNode{Matrix{Float32}, Nothing}:
- -3.0
source
JsonGrinder.CategoricalExtractorType
CategoricalExtractor{V, I} <: Extractor

Extracts a single item interpreted as a categorical variable into a one-hot encoded vector.

There is always an extra category for an unknown value (and hence the displayed n is one more than the number of categories).

Examples

julia> e = CategoricalExtractor(1:3)
+ -3.0
source
JsonGrinder.CategoricalExtractorType
CategoricalExtractor{V, I} <: Extractor

Extracts a single item interpreted as a categorical variable into a one-hot encoded vector.

There is always an extra category for an unknown value (and hence the displayed n is one more than the number of categories).

Examples

julia> e = CategoricalExtractor(1:3)
 CategoricalExtractor(n=4)
 
 julia> e(2)
@@ -88,13 +88,13 @@
  ⋅
  ⋅
  ⋅
- 1
source
JsonGrinder.NGramExtractorType
NGramExtractor{T} <: Extractor

Extracts String as n-grams (Mill.NGramMatrix).

Examples

julia> e = NGramExtractor()
+ 1
source
JsonGrinder.NGramExtractorType
NGramExtractor{T} <: Extractor

Extracts String as n-grams (Mill.NGramMatrix).

Examples

julia> e = NGramExtractor()
 NGramExtractor(n=3, b=256, m=2053)
 
 julia> e("foo")
 2053×1 ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}:
  "foo"
-
source
JsonGrinder.DictExtractorType
DictExtractor{S} <: Extractor

Extracts all items in a Dict and returns them as a Mill.ProductNode.

Examples

julia> e = (a=ScalarExtractor(), b=CategoricalExtractor(1:5)) |> DictExtractor
+
source
JsonGrinder.DictExtractorType
DictExtractor{S} <: Extractor

Extracts all items in a Dict and returns them as a Mill.ProductNode.

Examples

julia> e = (a=ScalarExtractor(), b=CategoricalExtractor(1:5)) |> DictExtractor
 DictExtractor
   ├── a: ScalarExtractor(c=0.0, s=1.0)
   ╰── b: CategoricalExtractor(n=6)
@@ -102,13 +102,13 @@
 julia> e(Dict("a" => 1, "b" => 1))
 ProductNode  1 obs, 0 bytes
   ├── a: ArrayNode(1×1 Array with Float32 elements)  1 obs, 60 bytes
-  ╰── b: ArrayNode(6×1 OneHotArray with Bool elements)  1 obs, 60 bytes
source
JsonGrinder.ArrayExtractorType
ArrayExtractor{T}

Extracts all items in an Array and returns them as a Mill.BagNode.

Examples

julia> e = ArrayExtractor(CategoricalExtractor(2:4))
+  ╰── b: ArrayNode(6×1 OneHotArray with Bool elements)  1 obs, 60 bytes
source
JsonGrinder.ArrayExtractorType
ArrayExtractor{T}

Extracts all items in an Array and returns them as a Mill.BagNode.

Examples

julia> e = ArrayExtractor(CategoricalExtractor(2:4))
 ArrayExtractor
   ╰── CategoricalExtractor(n=4)
 
 julia> e([2, 3, 1, 4])
 BagNode  1 obs, 64 bytes
-  ╰── ArrayNode(4×4 OneHotArray with Bool elements)  4 obs, 72 bytes
source
JsonGrinder.PolymorphExtractorType
PolymorphExtractor

Extracts to a Mill.ProductNode where each item is a result of different extractor.

Examples

julia> e = (NGramExtractor(), CategoricalExtractor(["tcp", "udp", "dhcp"])) |> PolymorphExtractor
+  ╰── ArrayNode(4×4 OneHotArray with Bool elements)  4 obs, 72 bytes
source
JsonGrinder.PolymorphExtractorType
PolymorphExtractor

Extracts to a Mill.ProductNode where each item is a result of different extractor.

Examples

julia> e = (NGramExtractor(), CategoricalExtractor(["tcp", "udp", "dhcp"])) |> PolymorphExtractor
 PolymorphExtractor
   ├── NGramExtractor(n=3, b=256, m=2053)
   ╰── CategoricalExtractor(n=4)
@@ -121,4 +121,4 @@
 julia> e("http")
 ProductNode  1 obs, 0 bytes
   ├── ArrayNode(2053×1 NGramMatrix with Int64 elements)  1 obs, 92 bytes
-  ╰── ArrayNode(4×1 OneHotArray with Bool elements)  1 obs, 60 bytes
source
+ ╰── ArrayNode(4×1 OneHotArray with Bool elements) 1 obs, 60 bytessource diff --git a/dev/api/schema/index.html b/dev/api/schema/index.html index 9de233e..78f3fe1 100644 --- a/dev/api/schema/index.html +++ b/dev/api/schema/index.html @@ -1,2 +1,2 @@ -Schema · JsonGrinder.jl

Schema API

Index

API

Base.mergeFunction
merge(schemas...)

Merge multiple schemas into one.

Useful when for example distributing calculation of schema across multiple workers to aggregate all results.

See also: merge!, schema.

source
+Schema · JsonGrinder.jl

Schema API

Index

API

Base.mergeFunction
merge(schemas...)

Merge multiple schemas into one.

Useful when for example distributing calculation of schema across multiple workers to aggregate all results.

See also: merge!, schema.

source
diff --git a/dev/api/utilities/index.html b/dev/api/utilities/index.html index 4b18682..8d1b45b 100644 --- a/dev/api/utilities/index.html +++ b/dev/api/utilities/index.html @@ -1,9 +1,9 @@ -Utilities API · JsonGrinder.jl

Utilities API

Index

API

Mill.reflectinmodelFunction
reflectinmodel(sch::Schema, ex::Extractor, args...; kwargs...)

Using schema sch and extractor ex, first create a representative sample and then call Mill.reflectinmodel.

source
JsonGrinder.remove_nullsFunction
remove_nulls(js)

Return a new document in which all null values (represented as nothing in julia) are removed.

Examples

julia> remove_nulls(Dict("a" => 1, "b" => nothing))
+Utilities API · JsonGrinder.jl

Utilities API

Index

API

Mill.reflectinmodelFunction
reflectinmodel(sch::Schema, ex::Extractor, args...; kwargs...)

Using schema sch and extractor ex, first create a representative sample and then call Mill.reflectinmodel.

source
JsonGrinder.remove_nullsFunction
remove_nulls(js)

Return a new document in which all null values (represented as nothing in julia) are removed.

Examples

julia> remove_nulls(Dict("a" => 1, "b" => nothing))
 Dict{String, Union{Nothing, Int64}} with 1 entry:
   "a" => 1
 
 julia> [nothing, Dict("a" => 1), nothing, Dict("a" => nothing)] |> remove_nulls
 2-element Vector{Dict{String}}:
  Dict("a" => 1)
- Dict{String, Nothing}()
source
+ Dict{String, Nothing}()
source
diff --git a/dev/citation/index.html b/dev/citation/index.html index b94cad8..942c644 100644 --- a/dev/citation/index.html +++ b/dev/citation/index.html @@ -21,4 +21,4 @@ title = {JsonGrinder.jl: a flexible library for automated feature engineering and conversion of JSONs to Mill.jl structures}, url = {https://github.com/CTUAvastLab/JsonGrinder.jl}, version = {...}, -} +} diff --git a/dev/examples/mutagenesis/Manifest.toml b/dev/examples/mutagenesis/Manifest.toml index 8ec6ce3..877ff1c 100644 --- a/dev/examples/mutagenesis/Manifest.toml +++ b/dev/examples/mutagenesis/Manifest.toml @@ -42,9 +42,9 @@ version = "0.1.38" [[deps.Adapt]] deps = ["LinearAlgebra", "Requires"] -git-tree-sha1 = "d80af0733c99ea80575f612813fa6aa71022d33a" +git-tree-sha1 = "50c3c56a52972d78e8be9fd135bfb91c9574c140" uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" -version = "4.1.0" +version = "4.1.1" weakdeps = ["StaticArrays"] [deps.Adapt.extensions] @@ -288,9 +288,9 @@ version = "0.12.32" [[deps.Flux]] deps = ["Adapt", "ChainRulesCore", "Compat", "Functors", "LinearAlgebra", "MLDataDevices", "MLUtils", "MacroTools", "NNlib", "OneHotArrays", "Optimisers", "Preferences", "ProgressLogging", "Random", "Reexport", "Setfield", "SparseArrays", "SpecialFunctions", "Statistics", "Zygote"] -git-tree-sha1 = "37fa32a50c69c10c6ea1465d3054d98c75bd7777" +git-tree-sha1 = "df520a0727f843576801a0294f5be1a94be28e23" uuid = "587475ba-b771-5e3f-ad9e-33799f191a9c" -version = "0.14.22" +version = "0.14.25" [deps.Flux.extensions] FluxAMDGPUExt = "AMDGPU" @@ -299,22 +299,20 @@ version = "0.14.22" FluxEnzymeExt = "Enzyme" FluxMPIExt = "MPI" FluxMPINCCLExt = ["CUDA", "MPI", "NCCL"] - FluxMetalExt = "Metal" [deps.Flux.weakdeps] AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e" CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba" Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9" MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195" - Metal = "dde4c033-4e86-420c-a63e-0dd931031962" NCCL = "3fe64909-d7a1-4096-9b7d-7a0f12cf0f6b" cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd" [[deps.ForwardDiff]] deps = ["CommonSubexpressions", "DiffResults", "DiffRules", "LinearAlgebra", "LogExpFunctions", "NaNMath", "Preferences", "Printf", "Random", "SpecialFunctions"] -git-tree-sha1 = "cf0fe81336da9fb90944683b8c41984b08793dad" +git-tree-sha1 = "a9ce73d3c827adab2d70bf168aaece8cce196898" uuid = "f6369f11-7733-5829-9624-2563aa707210" -version = "0.10.36" +version = "0.10.37" weakdeps = ["StaticArrays"] [deps.ForwardDiff.extensions] @@ -417,9 +415,9 @@ version = "0.21.4" [[deps.JsonGrinder]] deps = ["Accessors", "Compat", "HierarchicalUtils", "MacroTools", "Mill", "OneHotArrays", "Preferences", "SHA"] -git-tree-sha1 = "3d8ec35eefee7e027637b88816f7f32f52b81770" +git-tree-sha1 = "d03aa1b2b8cacbd8333017d8c2d26dd8bc7a2793" uuid = "d201646e-a9c0-11e8-1063-23b139159713" -version = "2.5.5" +version = "2.6.0" [[deps.JuliaVariables]] deps = ["MLStyle", "NameResolution"] @@ -600,9 +598,9 @@ version = "0.2.0" [[deps.Mill]] deps = ["Accessors", "ChainRulesCore", "Combinatorics", "Compat", "DataFrames", "DataStructures", "FiniteDifferences", "Flux", "HierarchicalUtils", "LinearAlgebra", "MLUtils", "MacroTools", "OneHotArrays", "PooledArrays", "Preferences", "SparseArrays", "Statistics", "Test"] -git-tree-sha1 = "924d500a23b70bbd647f55dec15c55e3674e80cc" +git-tree-sha1 = "89c6327c121b1141d8da12aa67882cf9bf2a3da5" uuid = "1d0525e4-8992-11e8-313c-e310e1f6ddea" -version = "2.10.6" +version = "2.11.0" [[deps.Missings]] deps = ["DataAPI"] @@ -780,9 +778,9 @@ version = "0.7.0" [[deps.SentinelArrays]] deps = ["Dates", "Random"] -git-tree-sha1 = "305becf8af67eae1dbc912ee9097f00aeeabb8d5" +git-tree-sha1 = "d0553ce4031a081cc42387a9b9c8441b7d99f32d" uuid = "91c51154-3ec4-41a3-a24f-3f23e20d615c" -version = "1.4.6" +version = "1.4.7" [[deps.Serialization]] uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b" diff --git a/dev/examples/mutagenesis/mutagenesis.ipynb b/dev/examples/mutagenesis/mutagenesis.ipynb index 109b503..64239b0 100644 --- a/dev/examples/mutagenesis/mutagenesis.ipynb +++ b/dev/examples/mutagenesis/mutagenesis.ipynb @@ -36,11 +36,11 @@ "text": [ " Activating project at `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/src/examples/mutagenesis`\n", "Status `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/src/examples/mutagenesis/Project.toml`\n", - " [587475ba] Flux v0.14.22\n", + " [587475ba] Flux v0.14.25\n", " [682c06a0] JSON v0.21.4\n", - " [d201646e] JsonGrinder v2.5.5\n", + " [d201646e] JsonGrinder v2.6.0\n", " [f1d291b0] MLUtils v0.4.4\n", - " [1d0525e4] Mill v2.10.6\n" + " [1d0525e4] Mill v2.11.0\n" ] } ], diff --git a/dev/examples/mutagenesis/mutagenesis/index.html b/dev/examples/mutagenesis/mutagenesis/index.html index 9fb0654..c3913d0 100644 --- a/dev/examples/mutagenesis/mutagenesis/index.html +++ b/dev/examples/mutagenesis/mutagenesis/index.html @@ -144,4 +144,4 @@ ┌ Info: Epoch 9 └ accuracy = 0.82 ┌ Info: Epoch 10 -└ accuracy = 0.82

We can compute the accuracy on the testing set now:

accuracy(pred(model, x_test), y_test)
0.8636363636363636
+└ accuracy = 0.82

We can compute the accuracy on the testing set now:

accuracy(pred(model, x_test), y_test)
0.8636363636363636
diff --git a/dev/examples/recipes/Manifest.toml b/dev/examples/recipes/Manifest.toml index 18d6269..eca2c27 100644 --- a/dev/examples/recipes/Manifest.toml +++ b/dev/examples/recipes/Manifest.toml @@ -42,9 +42,9 @@ version = "0.1.38" [[deps.Adapt]] deps = ["LinearAlgebra", "Requires"] -git-tree-sha1 = "d80af0733c99ea80575f612813fa6aa71022d33a" +git-tree-sha1 = "50c3c56a52972d78e8be9fd135bfb91c9574c140" uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" -version = "4.1.0" +version = "4.1.1" weakdeps = ["StaticArrays"] [deps.Adapt.extensions] @@ -288,9 +288,9 @@ version = "0.12.32" [[deps.Flux]] deps = ["Adapt", "ChainRulesCore", "Compat", "Functors", "LinearAlgebra", "MLDataDevices", "MLUtils", "MacroTools", "NNlib", "OneHotArrays", "Optimisers", "Preferences", "ProgressLogging", "Random", "Reexport", "Setfield", "SparseArrays", "SpecialFunctions", "Statistics", "Zygote"] -git-tree-sha1 = "37fa32a50c69c10c6ea1465d3054d98c75bd7777" +git-tree-sha1 = "df520a0727f843576801a0294f5be1a94be28e23" uuid = "587475ba-b771-5e3f-ad9e-33799f191a9c" -version = "0.14.22" +version = "0.14.25" [deps.Flux.extensions] FluxAMDGPUExt = "AMDGPU" @@ -299,22 +299,20 @@ version = "0.14.22" FluxEnzymeExt = "Enzyme" FluxMPIExt = "MPI" FluxMPINCCLExt = ["CUDA", "MPI", "NCCL"] - FluxMetalExt = "Metal" [deps.Flux.weakdeps] AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e" CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba" Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9" MPI = "da04e1cc-30fd-572f-bb4f-1f8673147195" - Metal = "dde4c033-4e86-420c-a63e-0dd931031962" NCCL = "3fe64909-d7a1-4096-9b7d-7a0f12cf0f6b" cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd" [[deps.ForwardDiff]] deps = ["CommonSubexpressions", "DiffResults", "DiffRules", "LinearAlgebra", "LogExpFunctions", "NaNMath", "Preferences", "Printf", "Random", "SpecialFunctions"] -git-tree-sha1 = "cf0fe81336da9fb90944683b8c41984b08793dad" +git-tree-sha1 = "a9ce73d3c827adab2d70bf168aaece8cce196898" uuid = "f6369f11-7733-5829-9624-2563aa707210" -version = "0.10.36" +version = "0.10.37" weakdeps = ["StaticArrays"] [deps.ForwardDiff.extensions] @@ -423,9 +421,9 @@ version = "1.14.1" [[deps.JsonGrinder]] deps = ["Accessors", "Compat", "HierarchicalUtils", "MacroTools", "Mill", "OneHotArrays", "Preferences", "SHA"] -git-tree-sha1 = "3d8ec35eefee7e027637b88816f7f32f52b81770" +git-tree-sha1 = "d03aa1b2b8cacbd8333017d8c2d26dd8bc7a2793" uuid = "d201646e-a9c0-11e8-1063-23b139159713" -version = "2.5.5" +version = "2.6.0" [[deps.JuliaVariables]] deps = ["MLStyle", "NameResolution"] @@ -606,9 +604,9 @@ version = "0.2.0" [[deps.Mill]] deps = ["Accessors", "ChainRulesCore", "Combinatorics", "Compat", "DataFrames", "DataStructures", "FiniteDifferences", "Flux", "HierarchicalUtils", "LinearAlgebra", "MLUtils", "MacroTools", "OneHotArrays", "PooledArrays", "Preferences", "SparseArrays", "Statistics", "Test"] -git-tree-sha1 = "924d500a23b70bbd647f55dec15c55e3674e80cc" +git-tree-sha1 = "89c6327c121b1141d8da12aa67882cf9bf2a3da5" uuid = "1d0525e4-8992-11e8-313c-e310e1f6ddea" -version = "2.10.6" +version = "2.11.0" [[deps.Missings]] deps = ["DataAPI"] @@ -786,9 +784,9 @@ version = "0.7.0" [[deps.SentinelArrays]] deps = ["Dates", "Random"] -git-tree-sha1 = "305becf8af67eae1dbc912ee9097f00aeeabb8d5" +git-tree-sha1 = "d0553ce4031a081cc42387a9b9c8441b7d99f32d" uuid = "91c51154-3ec4-41a3-a24f-3f23e20d615c" -version = "1.4.6" +version = "1.4.7" [[deps.Serialization]] uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b" diff --git a/dev/examples/recipes/recipes.ipynb b/dev/examples/recipes/recipes.ipynb index bcce73c..84dfb92 100644 --- a/dev/examples/recipes/recipes.ipynb +++ b/dev/examples/recipes/recipes.ipynb @@ -37,11 +37,11 @@ "text": [ " Activating project at `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/src/examples/recipes`\n", "Status `~/work/JsonGrinder.jl/JsonGrinder.jl/docs/src/examples/recipes/Project.toml`\n", - " [587475ba] Flux v0.14.22\n", + " [587475ba] Flux v0.14.25\n", " [0f8b85d8] JSON3 v1.14.1\n", - " [d201646e] JsonGrinder v2.5.5\n", + " [d201646e] JsonGrinder v2.6.0\n", " [f1d291b0] MLUtils v0.4.4\n", - " [1d0525e4] Mill v2.10.6\n", + " [1d0525e4] Mill v2.11.0\n", " [0b1bfda6] OneHotArrays v0.2.5\n" ] } diff --git a/dev/examples/recipes/recipes/index.html b/dev/examples/recipes/recipes/index.html index c960ce9..66ebb01 100644 --- a/dev/examples/recipes/recipes/index.html +++ b/dev/examples/recipes/recipes/index.html @@ -147,4 +147,4 @@ ┌ Info: Epoch 19 └ accuracy = 0.995 ┌ Info: Epoch 20 -└ accuracy = 0.9965

Finally, let's measure the testing accuracy. In this case, the classifier is overfitted:

accuracy(model(extract(e, jss_test)), y_test)
0.66
+└ accuracy = 0.9965

Finally, let's measure the testing accuracy. In this case, the classifier is overfitted:

accuracy(model(extract(e, jss_test)), y_test)
0.66
diff --git a/dev/index.html b/dev/index.html index dc4fca5..042bd3f 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,3 +1,3 @@ Home · JsonGrinder.jl
JsonGrinder.jl logo -JsonGrinder.jl logo

JsonGrinder.jl is a library that facilitates processing of JSON documents into Mill.jl structures for machine learning. It provides functionality for JSON schema inference, extraction of JSON documents to a suitable representation for machine learning, and constructing a model operating on this data.

Watch our introductory talk from JuliaCon 2021.

Installation

Run the following in REPL:

] add JsonGrinder

Julia v1.9 or later is required.

Getting started

For the quickest start, see the Mutagenesis example.

+JsonGrinder.jl logo

JsonGrinder.jl is a library that facilitates processing of JSON documents into Mill.jl structures for machine learning. It provides functionality for JSON schema inference, extraction of JSON documents to a suitable representation for machine learning, and constructing a model operating on this data.

Watch our introductory talk from JuliaCon 2021.

Installation

Run the following in REPL:

] add JsonGrinder

Julia v1.9 or later is required.

Getting started

For the quickest start, see the Mutagenesis example.

diff --git a/dev/manual/extraction/index.html b/dev/manual/extraction/index.html index 7f701ff..24756e8 100644 --- a/dev/manual/extraction/index.html +++ b/dev/manual/extraction/index.html @@ -74,4 +74,4 @@ ├── a: StableExtractor(CategoricalExtractor(n=2)) ╰── b: StableExtractor(CategoricalExtractor(n=2))
julia> e_stable(jss[2])ProductNode 1 obs, 0 bytes ├── a: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o - ╰── b: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o + ╰── b: ArrayNode(2×1 MaybeHotMatrix with Union{Missing, Bool} elements) 1 o diff --git a/dev/manual/schema_inference/index.html b/dev/manual/schema_inference/index.html index f4368eb..aa4d048 100644 --- a/dev/manual/schema_inference/index.html +++ b/dev/manual/schema_inference/index.html @@ -89,4 +89,4 @@ ╰── LeafEntry (2 unique `Real` values) 2x updated
julia> schema(remove_nulls ∘ JSON.parse, [
            """ {"a": {"b": null} } """
        ])DictEntry 1x updated
-  ╰── a: DictEntry 1x updated
+ ╰── a: DictEntry 1x updated diff --git a/dev/motivation/index.html b/dev/motivation/index.html index cb7beab..16d821d 100644 --- a/dev/motivation/index.html +++ b/dev/motivation/index.html @@ -27,4 +27,4 @@ } ] }

We would like to predict the mutagenicity for Salmonella typhimurium of this molecule (for example, the example above is mutagenic).

Majority of machine learning libraries assume the data takes form of tensors of a fixed dimension (like vectors or images) or a sequence of such tensors.

In contrast, JsonGrider.jl only requires your data to be stored in a flexible JSON format, and tries to automate most labor using reasonable defaults, while still giving you an option to control and tweak almost everything. JsonGrider.jl is built on top of Mill.jl which itself is built on top of Flux.jl.

Other formats

Although JsonGrinder was designed for JSON files, it can easily be adapted for XML, Protocol Buffers, MessagePack, and other similar formats.

Pipeline structure

Standard JsonGrinder.jl pipeline usually consists of five steps:

  1. Create a schema of JSON files (using schema).
  2. From this schema create an extractor converting JSONs to Mill.jl structures (e.g. using suggestextractor).
  3. Extract your JSON documents into Mill.jl structures with the extractor (e.g. with extract). If all data fits into memory, extract everything at once, or extract on-demand when training.
  4. Define a suitable model (e.g. using Mill.reflectinmodel).
  5. Train the model, the library is 100% compatible with the Flux.jl tooling.

The basic workflow can be visualized as follows:

JsonGrinder workflow -JsonGrinder workflow

The framework is able to process hierarchical JSON documents of any schema, embedding the documents into vectors. The embeddings can be used for classification, regression, and other ML tasks. Thanks to Mill.jl, models can handle missing values at all levels.

See Mutagenesis for complete example of processing JSONs like the one above, including code.

+JsonGrinder workflow

The framework is able to process hierarchical JSON documents of any schema, embedding the documents into vectors. The embeddings can be used for classification, regression, and other ML tasks. Thanks to Mill.jl, models can handle missing values at all levels.

See Mutagenesis for complete example of processing JSONs like the one above, including code.

diff --git a/dev/tools/hierarchical/index.html b/dev/tools/hierarchical/index.html index adcc194..2d3e9d3 100644 --- a/dev/tools/hierarchical/index.html +++ b/dev/tools/hierarchical/index.html @@ -35,4 +35,4 @@ "M" "O" "k"

We can even get Accessors.jl optics:

julia> optic = code2lens(sch, "M") |> only(@o _.children[:a].children[:c])

which can be used to access the nodes too (as well as many other operations):

using Accessors
julia> getall(sch, optic) |> onlyArrayEntry 2x updated
-  ╰── LeafEntry (2 unique `Real` values) 2x updated
Further reading

For the complete showcase of possibilities, refer to the HierarchicalUtils.jl manual.

+ ╰── LeafEntry (2 unique `Real` values) 2x updated
Further reading

For the complete showcase of possibilities, refer to the HierarchicalUtils.jl manual.

diff --git a/dev/tools/hyperopt/index.html b/dev/tools/hyperopt/index.html index 4a025d2..64d7615 100644 --- a/dev/tools/hyperopt/index.html +++ b/dev/tools/hyperopt/index.html @@ -47,7 +47,7 @@ activation = [identity, relu, tanh] model = train_model(epochs, batchsize, d, layers, activation) accuracy(pred(model, x_val), y_val) - end Hyperoptimizing 4%|█▌ | ETA: 0:01:31 Hyperoptimizing 6%|██▏ | ETA: 0:02:09 Hyperoptimizing 8%|██▉ | ETA: 0:01:39 Hyperoptimizing 10%|███▋ | ETA: 0:01:23 Hyperoptimizing 12%|████▍ | ETA: 0:01:13 Hyperoptimizing 14%|█████ | ETA: 0:01:23 Hyperoptimizing 16%|█████▊ | ETA: 0:01:12 Hyperoptimizing 18%|██████▌ | ETA: 0:01:04 Hyperoptimizing 20%|███████▎ | ETA: 0:00:57 Hyperoptimizing 22%|███████▉ | ETA: 0:01:09 Hyperoptimizing 24%|████████▋ | ETA: 0:01:11 Hyperoptimizing 26%|█████████▍ | ETA: 0:01:04 Hyperoptimizing 28%|██████████▏ | ETA: 0:00:59 Hyperoptimizing 30%|██████████▊ | ETA: 0:00:54 Hyperoptimizing 32%|███████████▌ | ETA: 0:00:50 Hyperoptimizing 34%|████████████▎ | ETA: 0:00:47 Hyperoptimizing 36%|█████████████ | ETA: 0:00:43 Hyperoptimizing 38%|█████████████▋ | ETA: 0:00:45 Hyperoptimizing 40%|██████████████▍ | ETA: 0:00:42 Hyperoptimizing 42%|███████████████▏ | ETA: 0:00:43 Hyperoptimizing 44%|███████████████▉ | ETA: 0:00:41 Hyperoptimizing 46%|████████████████▌ | ETA: 0:00:38 Hyperoptimizing 48%|█████████████████▎ | ETA: 0:00:38 Hyperoptimizing 50%|██████████████████ | ETA: 0:00:36 Hyperoptimizing 52%|██████████████████▊ | ETA: 0:00:34 Hyperoptimizing 54%|███████████████████▌ | ETA: 0:00:31 Hyperoptimizing 56%|████████████████████▏ | ETA: 0:00:29 Hyperoptimizing 58%|████████████████████▉ | ETA: 0:00:27 Hyperoptimizing 60%|█████████████████████▋ | ETA: 0:00:25 Hyperoptimizing 62%|██████████████████████▍ | ETA: 0:00:24 Hyperoptimizing 64%|███████████████████████ | ETA: 0:00:22 Hyperoptimizing 66%|███████████████████████▊ | ETA: 0:00:20 Hyperoptimizing 68%|████████████████████████▌ | ETA: 0:00:18 Hyperoptimizing 70%|█████████████████████████▎ | ETA: 0:00:17 Hyperoptimizing 72%|█████████████████████████▉ | ETA: 0:00:16 Hyperoptimizing 74%|██████████████████████████▋ | ETA: 0:00:14 Hyperoptimizing 76%|███████████████████████████▍ | ETA: 0:00:13 Hyperoptimizing 78%|████████████████████████████▏ | ETA: 0:00:12 Hyperoptimizing 80%|████████████████████████████▊ | ETA: 0:00:11 Hyperoptimizing 82%|█████████████████████████████▌ | ETA: 0:00:09 Hyperoptimizing 84%|██████████████████████████████▎ | ETA: 0:00:08 Hyperoptimizing 86%|███████████████████████████████ | ETA: 0:00:07 Hyperoptimizing 88%|███████████████████████████████▋ | ETA: 0:00:06 Hyperoptimizing 90%|████████████████████████████████▍ | ETA: 0:00:05 Hyperoptimizing 92%|█████████████████████████████████▏ | ETA: 0:00:04 Hyperoptimizing 94%|█████████████████████████████████▉ | ETA: 0:00:03 Hyperoptimizing 96%|██████████████████████████████████▌ | ETA: 0:00:02 Hyperoptimizing 98%|███████████████████████████████████▎| ETA: 0:00:01 Hyperoptimizing 100%|████████████████████████████████████| Time: 0:00:46 + end Hyperoptimizing 4%|█▌ | ETA: 0:01:30 Hyperoptimizing 6%|██▏ | ETA: 0:02:08 Hyperoptimizing 8%|██▉ | ETA: 0:01:38 Hyperoptimizing 10%|███▋ | ETA: 0:01:22 Hyperoptimizing 12%|████▍ | ETA: 0:01:13 Hyperoptimizing 14%|█████ | ETA: 0:01:22 Hyperoptimizing 16%|█████▊ | ETA: 0:01:11 Hyperoptimizing 18%|██████▌ | ETA: 0:01:03 Hyperoptimizing 20%|███████▎ | ETA: 0:00:57 Hyperoptimizing 22%|███████▉ | ETA: 0:01:08 Hyperoptimizing 24%|████████▋ | ETA: 0:01:10 Hyperoptimizing 26%|█████████▍ | ETA: 0:01:03 Hyperoptimizing 28%|██████████▏ | ETA: 0:00:58 Hyperoptimizing 30%|██████████▊ | ETA: 0:00:53 Hyperoptimizing 32%|███████████▌ | ETA: 0:00:49 Hyperoptimizing 34%|████████████▎ | ETA: 0:00:46 Hyperoptimizing 36%|█████████████ | ETA: 0:00:43 Hyperoptimizing 38%|█████████████▋ | ETA: 0:00:45 Hyperoptimizing 40%|██████████████▍ | ETA: 0:00:41 Hyperoptimizing 42%|███████████████▏ | ETA: 0:00:42 Hyperoptimizing 44%|███████████████▉ | ETA: 0:00:39 Hyperoptimizing 46%|████████████████▌ | ETA: 0:00:37 Hyperoptimizing 48%|█████████████████▎ | ETA: 0:00:37 Hyperoptimizing 50%|██████████████████ | ETA: 0:00:35 Hyperoptimizing 52%|██████████████████▊ | ETA: 0:00:33 Hyperoptimizing 54%|███████████████████▌ | ETA: 0:00:31 Hyperoptimizing 56%|████████████████████▏ | ETA: 0:00:29 Hyperoptimizing 58%|████████████████████▉ | ETA: 0:00:27 Hyperoptimizing 60%|█████████████████████▋ | ETA: 0:00:25 Hyperoptimizing 62%|██████████████████████▍ | ETA: 0:00:23 Hyperoptimizing 64%|███████████████████████ | ETA: 0:00:21 Hyperoptimizing 66%|███████████████████████▊ | ETA: 0:00:20 Hyperoptimizing 68%|████████████████████████▌ | ETA: 0:00:18 Hyperoptimizing 70%|█████████████████████████▎ | ETA: 0:00:17 Hyperoptimizing 72%|█████████████████████████▉ | ETA: 0:00:16 Hyperoptimizing 74%|██████████████████████████▋ | ETA: 0:00:14 Hyperoptimizing 76%|███████████████████████████▍ | ETA: 0:00:13 Hyperoptimizing 78%|████████████████████████████▏ | ETA: 0:00:12 Hyperoptimizing 80%|████████████████████████████▊ | ETA: 0:00:10 Hyperoptimizing 82%|█████████████████████████████▌ | ETA: 0:00:09 Hyperoptimizing 84%|██████████████████████████████▎ | ETA: 0:00:08 Hyperoptimizing 86%|███████████████████████████████ | ETA: 0:00:07 Hyperoptimizing 88%|███████████████████████████████▋ | ETA: 0:00:06 Hyperoptimizing 90%|████████████████████████████████▍ | ETA: 0:00:05 Hyperoptimizing 92%|█████████████████████████████████▏ | ETA: 0:00:04 Hyperoptimizing 94%|█████████████████████████████████▉ | ETA: 0:00:03 Hyperoptimizing 96%|██████████████████████████████████▌ | ETA: 0:00:02 Hyperoptimizing 98%|███████████████████████████████████▎| ETA: 0:00:01 Hyperoptimizing 100%|████████████████████████████████████| Time: 0:00:46 Hyperoptimizer with 1 length: [3, 5, 10] 2 length: [16, 32, 64] @@ -61,4 +61,4 @@ batchsize = 64 d = 64 layers = 3 -activation = identity

Finally, we test the solution on the testing data:

julia> final_model = train_model(ho.maximizer...);
julia> accuracy(pred(final_model, x_test), y_test)0.628

This concludes a very simple example of how to integrate JsonGrinder.jl with Hyperopt.jl. Note that we could and should go further and experiment not only with the hyperparameters presented here, but also with the definition of the schema and/or the extractor, which can also have significant impact on the results.

+activation = identity

Finally, we test the solution on the testing data:

julia> final_model = train_model(ho.maximizer...);
julia> accuracy(pred(final_model, x_test), y_test)0.628

This concludes a very simple example of how to integrate JsonGrinder.jl with Hyperopt.jl. Note that we could and should go further and experiment not only with the hyperparameters presented here, but also with the definition of the schema and/or the extractor, which can also have significant impact on the results.