-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
I'm using OpenBLAS 0.12 on Linux, built with NO_AFFINITY=1 TARGET=SANDYBRIDGE.
My application is executing many blas call on many small matrices (smaller than 100x100). It's multi-threaded and each thread manage a large number of matrices. As matrices are independent I don't need any thread synchronization which make the application fast. It works well with reference BLAS and MKL but it is slow with OpenBLAS. On one thread OpenBLAS is much faster than reference BLAS but on 12 threads my speedup is 2 instead of 11 with the reference BLAS.
Profiling shows that the problem is in blas_memory_alloc (the loop line 993) and blas_memory_free. blas_lock is probably the culprit. I tried with both USE_TREAD=0 and USE_THREAD=1 but it did not change anything.
My understanding is that blas_memory_alloc is design for multi-threading on few large matrices. So I tried to replace it with a simple wrapper to malloc/free like this
void *blas_memory_alloc(int procpos) {
return malloc(BUFFER_SIZE);
}
but it crash. It seems that I'm misunderstanding how blas_memory_alloc should really work.
Would it be possible to make blas_memory_alloc efficient in my use case ? Else can I get any guideline on writting a dedicated memory allocator ?