Skip to content

Connection Pooling Deadlocks #6029

@ecrocombe

Description

@ecrocombe

We are experiencing application lock up which appears to be due to thread pool exhaustion.

Analysing the memory dump yields approx. 200 tasks waiting for connection to DB, whilst the other ~20 tasks are writing to error logs:
Image

Within the last 6 hours, we migrated the database from a separate host, onto the same host as the application, both now within docker swarm in an attempt to isolate networking issues, without avail.

Traces:

[Error] An exception occurred while iterating over the results of a query for context type '"Rust.Domain.Infrastructure.Uow.RustDbContext"'."
""System.InvalidOperationException: An exception has been raised that is likely due to a transient failure.
 ---> Npgsql.NpgsqlException (0x80004005): Exception while reading from stream
 ---> System.TimeoutException: Timeout during reading attempt
   at Npgsql.Internal.NpgsqlReadBuffer.<Ensure>g__EnsureLong|55_0(NpgsqlReadBuffer buffer, Int32 count, Boolean async, Boolean readingNotifications)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at Npgsql.Internal.NpgsqlConnector.ReadMessageLong(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at Npgsql.Internal.NpgsqlConnector.AuthenticateSASL(List`1 mechanisms, String username, Boolean async, CancellationToken cancellationToken)
   at Npgsql.Internal.NpgsqlConnector.Authenticate(String username, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.Internal.NpgsqlConnector.<Open>g__OpenCore|214_1(NpgsqlConnector conn, SslMode sslMode, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.Internal.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.PoolingDataSource.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.PoolingDataSource.<Get>g__RentAsync|33_0(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlConnection.<Open>g__OpenAsync|42_0(Boolean async, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(Boolean errorsExpected, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, Boolean errorsExpected)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
   at Npgsql.EntityFrameworkCore.PostgreSQL.Storage.Internal.NpgsqlExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---
   at Npgsql.EntityFrameworkCore.PostgreSQL.Storage.Internal.NpgsqlExecutionStrategy.ExecuteAsync[TState,TResult](TState state, Func`4 operation, Func`4 verifySucceeded, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable`1.AsyncEnumerator.MoveNextAsync()"
System.InvalidOperationException: An exception has been raised that is likely due to a transient failure.
   at async Task<TResult> Npgsql.EntityFrameworkCore.PostgreSQL.Storage.Internal.NpgsqlExecutionStrategy.ExecuteAsync<TState, TResult>(TState state, Func<DbContext, TState, CancellationToken, Task<TResult>> operation, Func<DbContext, TState, CancellationToken, Task<ExecutionResult<TResult>>> verifySucceeded, CancellationToken cancellationToken)
   at async ValueTask<bool> Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable<T>+AsyncEnumerator.MoveNextAsync() ---> Npgsql.NpgsqlException: Exception while reading from stream
   at async void Npgsql.Internal.NpgsqlReadBuffer.Ensure(int count)+EnsureLong(?)
   at async ValueTask<IBackendMessage> Npgsql.Internal.NpgsqlConnector.ReadMessageLong(bool async, DataRowLoadingMode dataRowLoadingMode, bool readingNotifications, bool isReadingPrependedMessage)
   at async Task Npgsql.Internal.NpgsqlConnector.AuthenticateSASL(List<string> mechanisms, string username, bool async, CancellationToken cancellationToken)
   at async Task Npgsql.Internal.NpgsqlConnector.Authenticate(string username, NpgsqlTimeout timeout, bool async, CancellationToken cancellationToken)
   at async Task Npgsql.Internal.NpgsqlConnector.Open(NpgsqlTimeout timeout, bool async, CancellationToken cancellationToken)+OpenCore(?)
   at async Task Npgsql.Internal.NpgsqlConnector.Open(NpgsqlTimeout timeout, bool async, CancellationToken cancellationToken)
   at async ValueTask<NpgsqlConnector> Npgsql.PoolingDataSource.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, bool async, CancellationToken cancellationToken)
   at async ValueTask<NpgsqlConnector> Npgsql.PoolingDataSource.Get(NpgsqlConnection conn, NpgsqlTimeout timeout, bool async, CancellationToken cancellationToken)+RentAsync(?)
   at async void Npgsql.NpgsqlConnection.Open()+OpenAsync(?)
   at async Task Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenInternalAsync(bool errorsExpected, CancellationToken cancellationToken) x 2
   at async Task<bool> Microsoft.EntityFrameworkCore.Storage.RelationalConnection.OpenAsync(CancellationToken cancellationToken, bool errorsExpected)
   at async Task<RelationalDataReader> Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteReaderAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at async Task<bool> Microsoft.EntityFrameworkCore.Query.Internal.SingleQueryingEnumerable<T>+AsyncEnumerator.InitializeReaderAsync(AsyncEnumerator enumerator, CancellationToken cancellationToken)
   at async Task<TResult> Npgsql.EntityFrameworkCore.PostgreSQL.Storage.Internal.NpgsqlExecutionStrategy.ExecuteAsync<TState, TResult>(TState state, Func<DbContext, TState, CancellationToken, Task<TResult>> operation, Func<DbContext, TState, CancellationToken, Task<ExecutionResult<TResult>>> verifySucceeded, CancellationToken cancellationToken) ---> System.TimeoutException: Timeout during reading attempt

   --- End of inner exception stack trace ---
   --- End of inner exception stack trace ---

This is the thread pool metrics, 1 sample per 5 seconds, leading up-to and including the event until the application is manually restarted at 04:44am:
Image

Connection String

NpgsqlConnectionStringBuilder builder = new()
{
    Database = configuration.Database.Database,
    Pooling = true,
    MaxPoolSize = configuration.Database.MaxPoolSize, // 2000
    MinPoolSize = configuration.Database.MinPoolSize, // 1000
    Password = configuration.Database.Password,
    Port = configuration.Database.Port,
    Host = configuration.Database.Host,
    Username = configuration.Database.Username,
    Timeout = 15,
    KeepAlive = 30,
    ApplicationName = nameof(Program),
    CommandTimeout = 120
};

Database configuration:
Other than some thread and memory adjustments, we have explicitly disabled SSL, the rest is default.

Packages:
Npgsql 9.0.2
Npgsql.EntityFrameworkCore.PostgreSQL version 9.0.2

Environment:
Docker Swarm
.NET 9.0.2
Unix 6.8.0.51 x64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions