Investigate compiling for several CPU architectures and selecting at load time
It should be possible to compile the C++ code with several different target architectures like SSE2, AVX, AVX2, FMA3, FMA4, and AVX512, and load the appropriate shared object when DUCC is imported.
For this purpose we need to check CPU features from within Python. This could be done by https://pypi.org/project/py-cpuinfo/ or https://pypi.org/project/cpufeature/ (suggestions welcome if you know other/better packages).
Once this works, we can probably provide binary wheels. Size will be about 2.5MB per shared library.