Debugging deadlocks in JPA and PostgreSQL
In this article I describe a recent problem I had with PostgreSQL, where a deadlock caused my application to crash for unclear reasons. Naturally I also describe how to debug such problems and the reason for the deadlock.
Our application
We have a large transaction doing batch uploads which inserts rows of
item_component
. Every 100 entities inserted it will flush the session
to the DB in order to save memory. That does not commit the transaction,
it just sends the SQL commands to the database.
For each item inserted we create a separate transaction which:
-
reads a
person
row, -
locks it with
LockMode.UPGRADE
to prevent concurrent modification (this is implemented withSELECT FOR UPDATE
, -
increments a counter in one column,
-
commits, thus dropping back the lock.
Because our batch insertion transaction does not need to lock the
person
table, this should work fine.
Finding the deadlock
Our code does not work: the first 100 items are created fine, the
person
objects are locked and incremented 100 times, but when I flush
the first 100 inserted items to the DB (which works OK), the next
person
lock/increment hangs forever. Deadlock.
After ruling out anything that is triggered by our JPA code (disable Hibernate Search, entity listeners, caching), I have to conclude this could be a database mystery.
The first step is to find out the PostgreSQL transaction IDs of my JPA transactions. This can be done using the code shown below. Our batch insertion has TX ID 3/1501 while our blocking lock/increment TX ID is 2/4942. If we list all lock and transaction information using the query shown below (after filtering out locks on indexes, views and sequences and removing columns we do not care about) we get the following results:
# | Type | TX ID | VTX ID | Mode | G | Name | K | SQL | W |
---|---|---|---|---|---|---|---|---|---|
TX = Transaction; VTX = Virtual TX; G = Granted; K = Kind; W = Waiting |
|||||||||
1 |
tuple |
2/4941 |
ExclusiveLock |
t |
person |
r |
select person0_.pers…rson_id=$1 for update |
t |
|
2 |
relation |
2/4941 |
RowShareLock |
t |
person |
r |
select person0_.pers…rson_id=$1 for update |
t |
|
3 |
virtualxid |
2/4941 |
ExclusiveLock |
t |
select person0_.pers…rson_id=$1 for update |
t |
|||
4 |
transactionid |
774990 |
2/4941 |
ShareLock |
f |
select person0_.pers…rson_id=$1 for update |
t |
||
5 |
relation |
3/1501 |
RowShareLock |
t |
person |
r |
<IDLE> in transaction |
f |
|
6 |
relation |
3/1501 |
RowExclusiveLock |
t |
item_component |
r |
<IDLE> in transaction |
f |
|
7 |
relation |
3/1501 |
AccessShareLock |
t |
person |
r |
<IDLE> in transaction |
f |
|
8 |
virtualxid |
3/1501 |
ExclusiveLock |
t |
<IDLE> in transaction |
f |
|||
9 |
transactionid |
774990 |
3/1501 |
ExclusiveLock |
t |
<IDLE> in transaction |
f |
Let’s first describe what each line is about:
-
This is an exclusive lock on the
person
row we are locking -
The PostgreSQL manual describes this lock as being taken for
FOR UPDATE
-
The PostgreSQL manual explains that Every transaction holds an exclusive lock on its virtual transaction ID for its entire duration.
-
The PostgreSQL manual explains that When one transaction finds it necessary to wait specifically for another transaction, it does so by attempting to acquire share lock on the other transaction ID. So here we have the lock/increment 2/4941 transaction waiting for our batch insertion transaction 3/1501 to complete.
-
Why this transaction holds this lock is beyond me. The PostgreSQL manual explains that this lock is obtained for
SELECT FOR UPDATE|SHARE
which is never called by this transaction, so go figure… -
This is a normal lock obtained automatically when inserting an item in a table. The table here is
item_component
. -
This is a normal lock obtained automatically when selecting an item from a table. This lock on
person
cannot harm and is entirely normal. -
The PostgreSQL manual explains that Every transaction holds an exclusive lock on its virtual transaction ID for its entire duration.
-
The PostgreSQL manual explains If a permanent ID is assigned to the transaction (which normally happens only if the transaction changes the state of the database), it also holds an exclusive lock on its permanent transaction ID until it ends
The diagnosis
My lock/increment transaction is waiting for the batch insertion
transaction to release its lock on the person
row it is trying to
lock. We now have to find out why it acquired this lock since I did not
do it manually.
PostgreSQL implicit locks
After trying it myself in a PostgreSQL console, the following SQL causes
a ROW SHARE
lock to be obtained on the person
row referenced by
foreign key 2:
begin;
insert into item_component (item_component_id, creator_id) values (nextval('hibernate_sequence'), 2);
Of course this ROW SHARE
lock is held until my batch insert
transaction is committed, and prevents my lock/increment transactions
from doing their job.
After a friendly chat on the PostgreSQL IRC channel I was able to get a confirmation that this behaviour is normal and a result of the RI (Referential Integrity) triggers. I have sent a mail to the documentation mailing list asking for better documentation of such implicit side-effect locks that will help others like me not waste days finding this out.
Now I need to find a better strategy for my code.
Appendix
[DebuggingdeadlocksinJPAandPostgreSQL-tid]#
=== Display Virtual Transaction ID in JPA
Use this code to find out your PostgreSQL virtual transaction ID:
public Object getTXID() {
Query query =
entityManager.createNativeQuery("select virtualtransaction from pg_locks where pid = pg_backend_pid()");
return query.getSingleResult();
}
[DebuggingdeadlocksinJPAandPostgreSQL-query]#
Listing all locks and their transactions
Use this SQL to list all locks and their transactions:
select
locktype, database, relation, page, tuple, transactionid, virtualtransaction, mode, granted,
relname, relkind,
datname, usename,
case
when length(current_query) > 40
then substring(current_query for 20) || '...'
|| substring(current_query from (length(current_query) - 19) for 20)
else current_query
end as query,
waiting, query_start, client_addr, client_port
from pg_locks
left outer join pg_class on oid = relation
left outer join pg_stat_activity on pid = procpid
order by pid;