add fast path for very small Dicts? #10907

jhlq · 2015-04-20T11:11:43Z

Was at first surprised that unique doesn't filter custom types even if they evaluate to equal by == but the helpdoc does say they should be equivalent and \equiv evaluates to is, however have noticed several times on both 0.3 and 0.4 that functions filling up an array with objects created through permutations and then return unique(result) will produce an array of variable length from a deterministic algorithm. The most recent encounter was with findpows (in this commit) before unique was replaced with uniquefilter. Have not had the handpower to unwrap a case into a neat isolated example and would not have opened this if not for the following unearthed feature of uniquefilter{T}(a::Array{T})=pushallunique!(T[],a):

julia> r=rand(Int,1000000)%3;

julia> @time for i in 1:1000;unique(r);end
elapsed time: 23.492532326 seconds (560000 bytes allocated)

julia> @time for i in 1:1000;Equations.uniquefilter(r);end
elapsed time: 11.805153784 seconds (160000 bytes allocated)

The text was updated successfully, but these errors were encountered:

nalimilan · 2015-04-20T12:01:13Z

The code used by unique is very close to what you use in uniquefilter. Just use e.g. @edit unique([1, 2]). An interesting difference is that unique uses an array for the return result, and a set to check whether the value has been encountered before. I guess the idea is that checking for the presence of an element in a set should be faster when there are many unique values, as sets are based on dicts (thus a lookup is O(1)) while a linear search in an array is O(n). When the number of unique values is low, using a dict is not efficient. But if I do r=rand(Int,1000000)%50, then unique is faster than uniquefilter. One could imagine giving the choice of two different algorithms, just like sort.

As regards unique being undeterministic, I'm afraid not much can be done unless you can provide an example (at least a complex one). It's worth noting that the docs could be improved to mention that "unique" and "equivalent" follow the definition of in, and thus of isequal, which is different from == and is.

toivoh · 2015-04-20T13:07:35Z

Have you defined a hash function that is consistent with your equality predicate? Otherwise sets and dicts won't work with the type.

jhlq · 2015-04-20T14:32:51Z

@nalimilan
Had a feeling it was like that, is there any way to make a fast check to see which version is likely to be faster without taking more time than is gained? Like check how many unique values are in a small random subset.

For a code example clone Equations.jl and:

quin@Nicol:~/uniquetest/Equations.jl/src$ git checkout b12345f10e311ed2d738e2b3ad5b9018a88b94e0
quin@Nicol:~/uniquetest/Equations.jl/src$ julia -q -L Equations -P 'using Equations;ex=:x*:y*3*:y*:x;term=addparse(ex)[1];sort!(term)'
julia> for i in 1:15;print(length(findpows(term)),' ');end
22 21 22 23 22 22 19 23 20 21 22 24 20 17 20

@toivoh
This is the defined equality: ==(p1::Pow,p2::Pow)=p1.x==p2.x&&p1.y==p2.y
What would be an appropriate hash function?

PythonNut · 2015-04-20T14:48:45Z

@jhlq

Does

hash(p::Pow) = hash(p.x) + hash(p.y)

work?

jhlq · 2015-04-20T15:03:36Z

Yep, thank you.

nalimilan · 2015-04-20T15:05:09Z

@jhlq You could check a subset manually, but I don't think that's a reasonable thing to do in the standard library function. For a particular application, you may be able to know in advance what the typical number of unique values will be.

Regarding hash, see its documentation. You should rather define Base.hash(x::Pow, h::UInt) = hash(hash(x.x, hash(x.y)), h) (or just define Pow as immutable if possible).

jhlq · 2015-04-20T15:44:04Z

Yeah previously there has been a lot of duplicates due to using permutations to match everything, have implemented expression sorting now so the location of numbers symbols and components are known which will make it possible to phase out permutations.

Will make a branch with immutables and see how that works out.

JeffBezanson · 2015-04-20T16:49:45Z

Maybe Dicts would benefit from a fast path that uses linear search for < 10 elements or so.

jhlq · 2015-04-20T18:16:03Z

Maybe Sets would benefit from not using Dicts?

julia> set.dict
Dict{Any,Void} with 1 entry:
  x => nothing

julia> @time a=nothing
elapsed time: 4.61e-6 seconds (80 bytes allocated)

Or are those 80 bytes already cared for?

simonster · 2015-04-20T18:27:08Z

There is only one nothing object, and referencing it does not allocate. The 80 bytes above are REPL/global scope overhead.

julia> f() = @time a = nothing;

julia> f()
elapsed time: 7.68e-7 seconds (0 bytes allocated)

jhlq · 2015-04-20T18:48:35Z

Ah, good. Still seems funny though that sets carry around a bunch of arrows connected to their contents all pointing at a location to signify that they don't point anywhere.

simonster · 2015-04-20T18:51:19Z

I don't think we do that, either:

julia> sizeof(Array(Void, 1000000))
0

JeffBezanson · 2015-04-20T18:59:33Z

Yes, Void has size zero, so using it as the value type effectively makes the value array go away completely.

jhlq · 2015-04-20T19:03:34Z

We most definately do.

julia> set.dict
Dict{Any,Void} with 1 entry:
  x => nothing

There it is, right between the x and the nothing. It may be a weightless arrow but it is not invisible and artistically isn't less redundancy more beautiful?

JeffBezanson · 2015-04-20T19:07:40Z

I agree that less redundancy is more beautiful, which is why Sets reuse Dicts instead of re-implementing the same functionality.

jhlq · 2015-04-20T19:14:15Z

How about if Dicts are split into layers of arrowless keys, arrowadders and values?

PythonNut · 2015-04-20T19:14:31Z

Maybe Dicts would benefit from a fast path that uses linear search for < 10 elements or so.

Would #8674 apply?

JeffBezanson · 2015-04-20T19:18:18Z

I think it would be an interesting experiment to implement a general set-of-tuples data structure, sorted and/or indexed, queryable in various ways, and see if it can be as fast as Set and Dict. I bet it's possible.

jhlq · 2015-04-20T19:26:05Z

Such an incitement to dig into the source code cannot go unheeded, the todolist has been expanded

mdcfrancis · 2015-05-06T12:54:52Z

I've just published (it is not in METADATA yet)

https://github.com/blackrock/NamedTuples.jl

This uses immutable types in place of small immutable dicts. This was written for a different purpose, but might help in this case. If you only use field accessors on the NamedTuples the performance is the same as regular field access.

KristofferC · 2015-05-06T13:40:18Z

@mdcfrancis That looks great!

lostanlen · 2015-11-11T23:00:57Z

NamedTuples are interesting because of immutability and property-based access. Yet, since they implement a tuple-like notion of equality (where order matters), they cannot be a replacement for immutable dicts.
Indeed:
@NT(a => 1, b => 2) == @NT(b => 2, a => 1)
returns false whereas
Dict(:a => 1, :b => 2) == Dict(:b => 2, :a => 1)
returns true.

My use case would be to use this data structure as key in a dictionary. Also, the fields are not symbols but double-ended queues, so there is no natural ordering to them.

JeffBezanson changed the title ~~Indeterministic unique(a) results and possible efficiency improvement~~ add fast path for very small Dicts? Apr 25, 2015

JeffBezanson added speculative Whether the change will be implemented is speculative performance Must go faster help wanted Indicates that a maintainer wants help on an issue or pull request labels Apr 25, 2015

StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016

JeffBezanson modified the milestones: 1.0, 0.6.0 Dec 28, 2016

JeffBezanson added this to the 1.x milestone May 2, 2017

JeffBezanson removed this from the 1.0 milestone May 2, 2017

DilumAluthge removed this from the 1.x milestone Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fast path for very small Dicts? #10907

add fast path for very small Dicts? #10907

jhlq commented Apr 20, 2015

nalimilan commented Apr 20, 2015

toivoh commented Apr 20, 2015

jhlq commented Apr 20, 2015

PythonNut commented Apr 20, 2015

jhlq commented Apr 20, 2015

nalimilan commented Apr 20, 2015

jhlq commented Apr 20, 2015

JeffBezanson commented Apr 20, 2015

jhlq commented Apr 20, 2015

simonster commented Apr 20, 2015

jhlq commented Apr 20, 2015

simonster commented Apr 20, 2015

JeffBezanson commented Apr 20, 2015

jhlq commented Apr 20, 2015

JeffBezanson commented Apr 20, 2015

jhlq commented Apr 20, 2015

PythonNut commented Apr 20, 2015

JeffBezanson commented Apr 20, 2015

jhlq commented Apr 20, 2015

mdcfrancis commented May 6, 2015

KristofferC commented May 6, 2015

lostanlen commented Nov 11, 2015

add fast path for very small Dicts? #10907

add fast path for very small Dicts? #10907

Comments

jhlq commented Apr 20, 2015

nalimilan commented Apr 20, 2015

toivoh commented Apr 20, 2015

jhlq commented Apr 20, 2015

PythonNut commented Apr 20, 2015

jhlq commented Apr 20, 2015

nalimilan commented Apr 20, 2015

jhlq commented Apr 20, 2015

JeffBezanson commented Apr 20, 2015

jhlq commented Apr 20, 2015

simonster commented Apr 20, 2015

jhlq commented Apr 20, 2015

simonster commented Apr 20, 2015

JeffBezanson commented Apr 20, 2015

jhlq commented Apr 20, 2015

JeffBezanson commented Apr 20, 2015

jhlq commented Apr 20, 2015

PythonNut commented Apr 20, 2015

JeffBezanson commented Apr 20, 2015

jhlq commented Apr 20, 2015

mdcfrancis commented May 6, 2015

KristofferC commented May 6, 2015

lostanlen commented Nov 11, 2015