Deadlock Busters in DNB

DNB has many systems that are business critical and important for the financial infrastructure. High transaction and high volume systems, with a high demand on performance and availability in addition to the fact that they are systems of record, pose a tuning and bug-fix challenge.

Audun Faaberg

Feb. 15 202314 min. read time

All major companies with large and complex IT-systems, have deadlocks hidden deep inside their core systems. While this does not typically cause major problems, it does however, result in negative user experiences, disruptions and can serve as an indication of poor system quality.

Deadlock busters

Inside one of DNB’s core systems, the number of deadlocks exceeded what we deemed acceptable. As a part of revamping our core systems, we decided to explore how this problem could be eliminated. So, in late 2019, the Operations Lead set up a team of experienced DNB’ers who became deadlock busters.

The team employed a methodology which enabled us to handle the problem step-by-step. Over the course of 2,5 years this team discovered four different classes of deadlocks in the said system. As of this past September we have removed 90% of all deadlocks, and continue to work on the remaining 10%. Please join us in the story regarding removal of deadlocks and database optimisation on some of our core systems.

Deadlocks, Explained

The technical definition of a deadlock is: Two processes in which each respective process is waiting for an event that only the other process can cause. In a database systems, the processes are typically waiting for the release of a database page or row. Neither of the processes can run, release resources, or end.

Deadlocks pose a potential problem in all multiprocessing systems in which several transactions or programs run concurrently.

This in turn leads to another issue: One reason it is hard to analyse and solve deadlocks is in fact, our own minds. When coding, a person typically believes their program is running all by itself. It has access to data base tables, without interference from other programs. (See left side of Figure 2).

What The User Experiences

This typically results in a screen failure, sometimes along with a message informing the user to retry. And sometimes the user does not receive such a message. The user retries the action – and it generally works fine the second time around – meaning such an error does not hinder work from being completed. It does however, result in a negative user experience and may leave questioning the quality of the system. This is not something we want internal nor external users to be exposed to.

How we Got Rid of Deadlocks

First – our team attempted to get an overview of the problem.

How many deadlocks are there?
Are some situations more common than others?
Are there different categories of deadlocks?
Is there a pattern in time of day when the deadlocks occur?
Are there other patterns we can trace?

The team scanned the database logs and created a spreadsheet of all deadlocks. These were specified by date, time, transaction, program and SQL for both the winning (holder) and the losing (victim) transaction. It covered four weeks of production, and in all, there were approximately 6000 deadlocks – around 1500 per week. This data provided a good basis for statistical analysis. Our team sorted and grouped it in various ways, but found the best priority was by program and SQL.

Error Type 1: Synchronisation

Once the data was sorted out, it became apparent that one deadlock type generated 25% of them all. This was by far the most common and was selected for the team’s first deep analysis.

When working with complex problems with multiple root causes, we often single out the most common error and focus on solving that first. We do not try to find the explanation of all problems at once. With luck, the solution for that single error is applicable to many of the other errors. And even if it is not – one has in the very least removed the most common error.

Our team established a working routine in which a group of six to eight experts spent two to three hours in a meeting room focused on this one specific deadlock error. For a long time, we got nowhere at all. Using several different tools to enhance the data on hand, our team finally succeeded in getting the extra bit of information needed.

We observed that an INSERT went into a deadlock with an UPDATE from an asynchronous process, and understood this was related to the synchronisation of another system. Our team had meetings with architects from “SystemA” and “SystemB”. We discovered there had been several prior discussions between these two systems, as the synchronisation had failed in some extremely rare cases. This caused the systems to have slightly different customer data – which is not something you want.

Together with both teams we studied a detailed trace of what happened in these systems, step-by- step, millisecond-by-millisecond.

This was not a typical deadlock situation, such as the one depicted in Figure 1. This is a very specific deadlock due to the cooperation between two systems.

In earlier discussions, each coding team pointed out that their respective system worked, and the discrepancy of data was caused by the other system. Now, TeamA asked SystemB not to synchronise back changes from SystemA, as it was not necessary in this case. TeamB showed that there is a parameter in the call, where a calling system (SystemA) may specify not to synchronise back.

Our team changed the code in SystemA to mark the parameter “do not synchronize back”. This change was implemented and the number of deadlocks immediately dropped by 33%, to approximately 1000 per week.

Now, for those of you who remember the numbers: This particular error caused 25% of all deadlocks – but once it entered production – the number of deadlocks dropped by 33%... - how can that be? The point here is that deadlocks are potential errors, waiting for a “deadlock window” to open. A deadlock is active for 10 seconds in our database, before the database management system decides to stop one of the transactions, roll it back and let the other transaction through. In those 10 seconds, other transactions may wait for some of the locks held by the two deadlocked transactions. They may hold many other locks, and those transactions may in turn hold yet even more locks. This 10 second-wait increases the possibility that other transactions may go in a deadlock, with the transaction waiting.

With this in production in February 2020, our team began looking into the next deadlocks. These were different. In early March 2020, the whole world locked down (as a result of COVID-19, not deadlocks), and put an end to our three-hour meetings of eight to ten people. At the time, we believed the lockdown would last a couple of months. Our method of having everyone in the same room had been very successful, and we decided to continue our work once lockdown had ended.

Error Type 2: Isolation Levels

By early 2021, it was understood the lockdowns would continue for some time, and we began holding one-hour virtual meetings to analyse the next deadlocks on the list. We were somewhat sceptical as to how efficiently this kind of analysis could be conducted in a virtual setting, but soon found it turned out fine.

Once our work had resumed, our team discovered something unexpected. Two transactions had only done SELECTs before UPDATEs, and the first UPDATEs in each transaction deadlocked. This seemed counterintuitive. How could that happen?

The team could see that the Isolation Level of the modules was ‘Read Stability,’ a stricter level than one would normally expect. Most transaction systems run with Isolation Level ‘Cursor Stability’.

Note from the author: Isolation Levels are the closest thing to philosophy in IT, and you can live happily as a programmer for your entire career without having the full and detailed understanding of the concept.

Our gut feeling was to change the Isolation Level to Cursor Stability, which is normal in high traffic transactional systems like this. Keep in mind – this is a critical system in production, and it did after all – work fine for 99,999% of the traffic. Some sporadic deadlocks, but nothing causing severe errors. Our gut feeling was thus not considered a valid argument for changing the Isolation Level. We had to show the exact cause-and-effect, and to explain how this would change with Cursor Stability. Every member of the team had worked in IT for decades, with an approximate understanding of Isolation Levels. Now we had to get a detailed understanding.

We searched through literature and blogs, but the concept of Isolation Levels is not well explained in available documentation and tech blogs. An explanation presented itself in an IDUG session with Steve Thomas, a DB2 guru. He states the difference is in LOCKING by the SELECTS. All Isolation Levels lock all rows or pages that are UPDATED, DELETED, INSERTED till COMMIT. The difference is whether they lock rows and pages that are SELECTED. Read Stability locks data rows and pages that are SELECTED, Cursor Stability does not.

Now, we could explain in minute details how the error occurs.

The main challenge for understanding this, is that we all work in applications where Isolation Level is Cursor Stability – that is the normal setup. Hence, we are not used to considering the consequences of applying another Isolation Level.

Using a DB tool, the team checked if there were many programs running in Read Stability. There were several hundred. As this was with a critical system with high transaction volumes, we could not simply change Isolation Levels for all programs. We decided to take a case-by-case approach. We found deadlocks where one or both programs had Isolation Level Read Stability, and evaluated the code. The team discussed the changes with the DBA and Application Architects from the original development team. As changes in Isolation Level can have functional consequences, we conducted thorough tests and changed a maximum of five programs for every go live. This kept us busy for half a year, and brought gradual changes throughout the system. The number of deadlocks decreased slowly from around 1000 down to roughly 700 per week.

Error Type 3: Modularisation

First, – modularisation is not an error. We strive to modularise our systems; this core system is highly modularised, with many programs running complex functionality, which we want to reuse in many settings.

Of the 700 deadlocks per week, 35% were in two programs, used in two different transactions – accessing the same table. The challenge here was that in both transactions, the deadlock occurred at the end of a very long chain of programs calling other programs. In order to analyse a deadlock, we often look at all SQLs executed in sequence- from the start of the transaction to the actual deadlock. It was difficult though, to see all the functional variants that gave different sequences of SQLs. It was quite simply, an overwhelming approach to analysis.

At this point, we switched strategy and scrutinised the functionality of the two transactions. Having highlighted where we found the deadlocks, we approached one of the architects. One transaction is used by an operator to INSERT a mortgage sale to a customer in the SALES table, while the other transaction is synchronising the customer to another database. In that process, it opens a cursor and FETCHES all SALES on the person in question. The architect explained that the SALES information was irrelevant for this specific synchronisation process. It was present since we used a standard module that was called in many different functional circumstances, and in other functional circumstances in which FETCH is needed.

The module that did the FETCHES was modified to evaluate where the call originated. If the starting point was this synchronisation transaction, the program would skip fetching information from the SALES table.

The core problem here was that the module ran functionality that was not strictly necessary for the synchronisation. That is just a matter of fact when using large functional building blocks like these modularised programs. But once we had pinpointed the problem, the change was like a normal functional change.

The number of deadlocks was expected to fall 35% after go live, but instead dropped 75%, approximately 150 per week. As in the situation described earlier – we removed a deadlock which often held other transactions, thus increasing the odds for deadlocks in yet more transactions.

Error Type 4: Data Distribution

The team finally began to find a number of “classic” deadlocks. Transaction 1 UPDATES Page A and Page B (in that sequence), and Transaction 2 UPDATES Page B and Page A (in that sequence). See Figure 1. This is the textbook example – but what to do about it?

Figure 7. The “classical” deadlock, in which two processes UPDATES data in a different sequence. Note that we lock full pages, not rows, so even if the two processes work on different rows, they compete for the same pages.

One simple way to solve this is to switch from page locks to row locks. In our clustered database setup however, the cluster nodes will still claim exclusive ownership of the pages- thus presenting other problems.

In the program containing most of these “classical” deadlocks, they were distributed among four tables. All tables were clustered on CASEID. Clustering in a database table refers to how the table is sorted. Here the CASEID always starts with the date it is entered – in a YYMMDD format– meaning all cases from today are clustered into the last couple of database pages.

Another attribute of this core system: It is a case management system, and most cases are closed within three to five days. When we combined this insight with the clustering, we understood that all transactions fight for the last 50-100 database pages in the tables, even if the tables are massive and distributed across several hundred thousand pages.

Figure 8: How clustering affects a system running many processes concurrently. If clustered on CASEID, all six processes fight for Page 7 and 8. When we distribute on customer – which is mostly random in this example, each process updates to a different data page. This dramatically reduces the potential for a deadlock.

We decided to cluster on another column. CUSTOMERID would distribute the traffic randomly throughout the table.

Up to this point, everything was fine. Keep in mind – we were at the centre of a critical core system in a large bank. Deadlocks and the effect of clustering is primarily visible in production – where traffic volumes are high.

Here, things began to become difficult. Our team decided to run a performance test to produce a considerable number of deadlocks – stated as around 200 - in a test run of 30 minutes. This was a test we could reproduce. We changed the clustering of the four tables and ran the same test again for 30 minutes. The result was three or four deadlocks, and since the processes were not competing for the same database pages, traffic through the database in those 30 minutes doubled. We demonstrated- in a test environment- that the specific change of clustering reduced the number of deadlocks by 99%.

Essentially, we had employed a natural science testing methodology.

Conduct observations: The number of deadlocks, the clustering of the tables.

Set up a hypothesis: The current clustering leads to the processes competing for data, ending in deadlocks. Another clustering will remove that.

Set up a test: One that is reproducible – to verify or falsify your hypothesis.

Test BEFORE and AFTER the change. Compare the results.

Note from the author: It was not easy to set up a deadlock test – material for my next blog post.

We discussed the findings with DBA of the original development team. They revealed they had designed the clustering to have the most current cases in few pages, hence having all relevant data in buffer. This made the system faster and use less CPU. While well-intended, this case illustrates the importance of balancing many interests and effects when tuning an IT-system.

As the original clustering was intended to improve performance, the team had to run a large performance test before and after in order to evaluate how much the new clustering increased response times and CPU consumption.

Some may think this was overkill. The fact of the matter is that if we had failed, it would be at the risk of our core system running very slowly. That is fodder for the media, and we wish to avoid such exposure.

Performance testing showed a penalty both in response time and CPU consumption, having lost the advantage of having all data in buffer. The different SQLs in varied in the range of a 30% to 300% increase. The total increase however, was deemed to be so small in absolute values that it was decided to be moved into production anyways. There we saw that the response times in general decreased. The explanation being that there is less waiting for data pages, due to the drop in conflict.

OPSYS

The entire process was well-suited for OPSYS working. There was list of problems – our team limited its focus to three to five of the most pressing problems. When some changes in production were implemented to solve say, the top two problems, the list of remaining problems - and thereby the priorities - also changed.

By keeping an OPSYS attitude and not following a strict and detailed work plan, our team was easily able to move on to address the top three to five problems on the new list.

If this had been run as a more traditional project (i.e., based work on the initial top list, prioritised and estimated on that basis), the team would have missed the target, because the target changed along the way.

Success Factors

We think the team hit six factors that resulted in progress for this effort:

We began by gathering a number of technical experts for several hours. This forced us to progress. After all – we wanted to get out of the meeting room 😊

We worked from a strict top list of the deadlock types. Always working with the most prevalent. Always letting the numbers in the spreadsheet guide us.

We contacted functional experts, and discussed how several systems work together. We saw that we could not solve everything from the technical side.

We had to go to new depths in the understanding of technical issues like Isolation Levels, Clustering and Locking. Issues that team members had superficial understanding of, yet. And somehow managed to navigate around for decades.

Analysing and testing carefully before moving to production. We strived to avoid production problems caused by our changes. Problems would have likely ended the team’s work, seeing as deadlocks were not critical to begin with.

Endless patience. This was not a quick-and easy task, and given the non-critical nature of the work, we often lost when prioritising.

Tools We Used:

Self-made tools to scan the database logs (two logs) finding information on all deadlocks, and matching this to the actual SQLs running. Presented in a spreadsheet.

SQL monitor –showing time of deadlocks, program, SQL and in most cases the parameters to the SQL when deadlocking.

Data dictionary tool – to see table and program definitions in the database (including Isolation Level for programs).

System specific functional monitor, listing time, program and parameters, response times, and if a deadlock occurred.

A cross system monitoring system (Dynatrace), giving response times and deadlocks, and the full sequence of SQLs run in a transaction.

Deadlock busters:

The Tech Nerds: Audun Faaberg, Bjørn Bøen, Gyana Samal, Akash Sharma

The Operation leads: Arne Von Essen, Adam El Deen, Bjørn Valaker, Bindu Sri Kandipati

The Old Guards: Steen Raabjerg, Jan Hansen, Jostein Sønnesyn

The Database Administrators: Jahangir Mohebbi, Marit Olufsen

The Performance Test Nerds: Arne Lundal, Priyanka Patil

Nerdy Analytics and Programmers: Someshwar Hiremath, Nitin Gangwal, Sharat Markandan, Basudev Singh, Jørgen Malmstrøm, Terje Bredesen, Trine Haugland Bakke

Thank you to all deadlock busters and to all other colleagues contributing, supporting and cheering us on.

Deadlock

Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy or position of DNB.