Shamelessly cross-posting this …

  • They could probably have gotten similar results by using a combination of numpy and numba. They could also have just written a C extension which they basically did. The key is to get the final code to run both in parallel and vectorize on your exact hardware. So there are compiler flag choices too if your using C. Nice though.