Needs a lot more documentation! (some example .cpp files would help a lot)
High priority: systematic unit test of every kernel and inline operator!
I haven't tested on old machines in a long time!
Phase out smask_t in favor of simd_t<T,S>::iscalar_tpe. (Also smask_ntuple.)
Phase out blendv() in favor of simd_if().
Phase out align() in favor of simd_align().
Double <-> int64_t conversions silently fail before ~2^52 or so, but not obvious how to fix this (see FIXME in convert.hpp.)
Unit test simd_t<T,S>::round(). Should also understand x86 "rounding mode".
Aligned/streaming load/store flags will be implemented soon! It might be useful to move my memory bandwidth profiling code to this github repo.
Fused multiply/adds would be good to implement soon. (I'd like to play with these and see if they can improve some of my existing kernels!)
Half-float loads/stores. (Needed soon for bonsai)
Align operations. (Needed soon for bonsai)
Horizontal reducing min/max. (Needed soon for bonsai)
Is there a way to use Intel SVML? (https://software.intel.com/en-us/node/524289)
Most exp/log type operations are still unimplemented.
Hmm, I think my two versions of operator>> are slightly inconsistent... what a pain!
I think more syntactic sugar would be nice. Random example: simd_min(x,y) can be a synonym for x.min(y)
The _vertical_dot() type routines are confusing and could be improved.
Low-priority: define 'struct simd_ntuple_align_helper'.
Not all upsampling/downsampling kernels have been implemented. So far we only have
- upsampling: float32, int32
- downsampling: float32
Currently there are many downsampling-type kernels which are nearly cut-and-paste equivalent. E.g., downsample(), downsample_max(), downsample_bitwise_or(). Should clean up by using template magic to eliminate redundancy!
For an integer type T, simd_t<T,S>::operator*() wraps the simplest possible multiplication intrinsic, but there are other possibilities. (E.g. _mm_mul_epi32() or _mm_mul_epu32() in addition to _mm_mullo_epi32() which corresponds to operator*()) These should be wrapped somehow as well.
Does it make sense to implement something for integer division? There are no "real" simd instructions for integer division (at least in AVX2). Maybe the best option is to extract every element of the simd_t, and do a scalar integer division? (There is an assembly instruction for scalar integer division, right?)
Comparison operators could have boolean template arguments to override the "quiet ordered" default.
Could write a 'make testx' target which runs the unit tests with multiple combinations of cpu flags (downgraded from -march=native), e.g. to test non-AVX2 kernels on an AVX2 machine.
Generally speaking, the non-AVX2 kernels are not very well-optimized, but I'm not sure how much of a priority this is. When documentation exists, it should say somewhere that 256-bit integer types are dubious without AVX2, and may be slower than 128-bit types.
Lots more integer types are possible (int8, uint8, int16, uint16, uint32, uint64)
In spite of the number of lines of boilerplate here, there is a lot missing when compared to the intel manuals!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO.md

TODO.md

Files

TODO.md

Latest commit

History

TODO.md

File metadata and controls