Skip to content

Very slow when having many small matrices and many threads #478

@jeromerobert

Description

@jeromerobert

I'm using OpenBLAS 0.12 on Linux, built with NO_AFFINITY=1 TARGET=SANDYBRIDGE.

My application is executing many blas call on many small matrices (smaller than 100x100). It's multi-threaded and each thread manage a large number of matrices. As matrices are independent I don't need any thread synchronization which make the application fast. It works well with reference BLAS and MKL but it is slow with OpenBLAS. On one thread OpenBLAS is much faster than reference BLAS but on 12 threads my speedup is 2 instead of 11 with the reference BLAS.

Profiling shows that the problem is in blas_memory_alloc (the loop line 993) and blas_memory_free. blas_lock is probably the culprit. I tried with both USE_TREAD=0 and USE_THREAD=1 but it did not change anything.

My understanding is that blas_memory_alloc is design for multi-threading on few large matrices. So I tried to replace it with a simple wrapper to malloc/free like this

void *blas_memory_alloc(int procpos) {
    return malloc(BUFFER_SIZE);
}

but it crash. It seems that I'm misunderstanding how blas_memory_alloc should really work.

Would it be possible to make blas_memory_alloc efficient in my use case ? Else can I get any guideline on writting a dedicated memory allocator ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions