Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fast path for very small Dicts? #10907

Open
jhlq opened this issue Apr 20, 2015 · 22 comments
Open

add fast path for very small Dicts? #10907

jhlq opened this issue Apr 20, 2015 · 22 comments
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster speculative Whether the change will be implemented is speculative

Comments

@jhlq
Copy link

jhlq commented Apr 20, 2015

Was at first surprised that unique doesn't filter custom types even if they evaluate to equal by == but the helpdoc does say they should be equivalent and \equiv evaluates to is, however have noticed several times on both 0.3 and 0.4 that functions filling up an array with objects created through permutations and then return unique(result) will produce an array of variable length from a deterministic algorithm. The most recent encounter was with findpows (in this commit) before unique was replaced with uniquefilter. Have not had the handpower to unwrap a case into a neat isolated example and would not have opened this if not for the following unearthed feature of uniquefilter{T}(a::Array{T})=pushallunique!(T[],a):

julia> r=rand(Int,1000000)%3;

julia> @time for i in 1:1000;unique(r);end
elapsed time: 23.492532326 seconds (560000 bytes allocated)

julia> @time for i in 1:1000;Equations.uniquefilter(r);end
elapsed time: 11.805153784 seconds (160000 bytes allocated)
@nalimilan
Copy link
Member

The code used by unique is very close to what you use in uniquefilter. Just use e.g. @edit unique([1, 2]). An interesting difference is that unique uses an array for the return result, and a set to check whether the value has been encountered before. I guess the idea is that checking for the presence of an element in a set should be faster when there are many unique values, as sets are based on dicts (thus a lookup is O(1)) while a linear search in an array is O(n). When the number of unique values is low, using a dict is not efficient. But if I do r=rand(Int,1000000)%50, then unique is faster than uniquefilter. One could imagine giving the choice of two different algorithms, just like sort.

As regards unique being undeterministic, I'm afraid not much can be done unless you can provide an example (at least a complex one). It's worth noting that the docs could be improved to mention that "unique" and "equivalent" follow the definition of in, and thus of isequal, which is different from == and is.

@toivoh
Copy link
Contributor

toivoh commented Apr 20, 2015

Have you defined a hash function that is consistent with your equality predicate? Otherwise sets and dicts won't work with the type.

@jhlq
Copy link
Author

jhlq commented Apr 20, 2015

@nalimilan
Had a feeling it was like that, is there any way to make a fast check to see which version is likely to be faster without taking more time than is gained? Like check how many unique values are in a small random subset.

For a code example clone Equations.jl and:

quin@Nicol:~/uniquetest/Equations.jl/src$ git checkout b12345f10e311ed2d738e2b3ad5b9018a88b94e0
quin@Nicol:~/uniquetest/Equations.jl/src$ julia -q -L Equations -P 'using Equations;ex=:x*:y*3*:y*:x;term=addparse(ex)[1];sort!(term)'
julia> for i in 1:15;print(length(findpows(term)),' ');end
22 21 22 23 22 22 19 23 20 21 22 24 20 17 20

@toivoh
This is the defined equality: ==(p1::Pow,p2::Pow)=p1.x==p2.x&&p1.y==p2.y
What would be an appropriate hash function?

@PythonNut
Copy link
Contributor

@jhlq

Does

hash(p::Pow) = hash(p.x) + hash(p.y)

work?

@jhlq
Copy link
Author

jhlq commented Apr 20, 2015

Yep, thank you.

@nalimilan
Copy link
Member

@jhlq You could check a subset manually, but I don't think that's a reasonable thing to do in the standard library function. For a particular application, you may be able to know in advance what the typical number of unique values will be.

Regarding hash, see its documentation. You should rather define Base.hash(x::Pow, h::UInt) = hash(hash(x.x, hash(x.y)), h) (or just define Pow as immutable if possible).

@jhlq
Copy link
Author

jhlq commented Apr 20, 2015

Yeah previously there has been a lot of duplicates due to using permutations to match everything, have implemented expression sorting now so the location of numbers symbols and components are known which will make it possible to phase out permutations.

Will make a branch with immutables and see how that works out.

@JeffBezanson
Copy link
Member

Maybe Dicts would benefit from a fast path that uses linear search for < 10 elements or so.

@jhlq
Copy link
Author

jhlq commented Apr 20, 2015

Maybe Sets would benefit from not using Dicts?

julia> set.dict
Dict{Any,Void} with 1 entry:
  x => nothing

julia> @time a=nothing
elapsed time: 4.61e-6 seconds (80 bytes allocated)

Or are those 80 bytes already cared for?

@simonster
Copy link
Member

There is only one nothing object, and referencing it does not allocate. The 80 bytes above are REPL/global scope overhead.

julia> f() = @time a = nothing;

julia> f()
elapsed time: 7.68e-7 seconds (0 bytes allocated)

@jhlq
Copy link
Author

jhlq commented Apr 20, 2015

Ah, good. Still seems funny though that sets carry around a bunch of arrows connected to their contents all pointing at a location to signify that they don't point anywhere.

@simonster
Copy link
Member

I don't think we do that, either:

julia> sizeof(Array(Void, 1000000))
0

@JeffBezanson
Copy link
Member

Yes, Void has size zero, so using it as the value type effectively makes the value array go away completely.

@jhlq
Copy link
Author

jhlq commented Apr 20, 2015

We most definately do.

julia> set.dict
Dict{Any,Void} with 1 entry:
  x => nothing

There it is, right between the x and the nothing. It may be a weightless arrow but it is not invisible and artistically isn't less redundancy more beautiful?

@JeffBezanson
Copy link
Member

I agree that less redundancy is more beautiful, which is why Sets reuse Dicts instead of re-implementing the same functionality.

@jhlq
Copy link
Author

jhlq commented Apr 20, 2015

How about if Dicts are split into layers of arrowless keys, arrowadders and values?

@PythonNut
Copy link
Contributor

Maybe Dicts would benefit from a fast path that uses linear search for < 10 elements or so.

Would #8674 apply?

@JeffBezanson
Copy link
Member

I think it would be an interesting experiment to implement a general set-of-tuples data structure, sorted and/or indexed, queryable in various ways, and see if it can be as fast as Set and Dict. I bet it's possible.

@jhlq
Copy link
Author

jhlq commented Apr 20, 2015

Such an incitement to dig into the source code cannot go unheeded, the todolist has been expanded

@JeffBezanson JeffBezanson changed the title Indeterministic unique(a) results and possible efficiency improvement add fast path for very small Dicts? Apr 25, 2015
@JeffBezanson JeffBezanson added speculative Whether the change will be implemented is speculative performance Must go faster help wanted Indicates that a maintainer wants help on an issue or pull request labels Apr 25, 2015
@mdcfrancis
Copy link

I've just published (it is not in METADATA yet)

https://github.com/blackrock/NamedTuples.jl

This uses immutable types in place of small immutable dicts. This was written for a different purpose, but might help in this case. If you only use field accessors on the NamedTuples the performance is the same as regular field access.

@KristofferC
Copy link
Member

@mdcfrancis That looks great!

@lostanlen
Copy link

NamedTuples are interesting because of immutability and property-based access. Yet, since they implement a tuple-like notion of equality (where order matters), they cannot be a replacement for immutable dicts.
Indeed:
@NT(a => 1, b => 2) == @NT(b => 2, a => 1)
returns false whereas
Dict(:a => 1, :b => 2) == Dict(:b => 2, :a => 1)
returns true.

My use case would be to use this data structure as key in a dictionary. Also, the fields are not symbols but double-ended queues, so there is no natural ordering to them.

@StefanKarpinski StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016
@JeffBezanson JeffBezanson modified the milestones: 1.0, 0.6.0 Dec 28, 2016
@JeffBezanson JeffBezanson added this to the 1.x milestone May 2, 2017
@JeffBezanson JeffBezanson removed this from the 1.0 milestone May 2, 2017
@DilumAluthge DilumAluthge removed this from the 1.x milestone Mar 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests