My Approach To Solving Software Engineering ProblemsSoftware engineering problem solving is a key skill for software engineers, from analysing what went wrong to developing a solution.

Software engineering problems usually occur in production environments where we have little to no access. We have access to the logs, but often we need more. You can never log too much information.
The first step in analysing the problem is determining what went wrong. Often, the customer will report that they did x, y, and z, and the system did a, b, and c. Our company has a support team who will talk through the problem and test it themselves. As developers, when we get a support ticket, it usually comes with steps to reproduce an issue.

First, we can fire up our local dev environment and see if we can reproduce the problem locally. This is an ideal scenario. If not, we try the QA server, and if we still can't reproduce, it's time to get hold of a sanitised copy of the production database. Our DBAs will remove all traces of customer data and ensure that the remaining data has been randomised and complies with HIPAA regulations. We can then rerun our tests and replicate the issue, especially if some bad data got into the database somehow. In rare cases, there will be an intermittent error, which is the worst-case scenario.
Assuming we can replicate the problem, we can set up some breakpoints in the code and step through relevant areas to narrow down the code causing the problem. If there is bad data, we have to see where that came from and devise a script to fix the data. It could be a key not being set or a value being Null when we expect it to have one. Sometimes, the database's foreign keys have become corrupt, so a key on one table may not exist in the linked table.
Some problems may be caused by the logic code itself; a complex if statement could be misbehaving, or a value could be mapped incorrectly.

For the intermittent issues, they are challenging to narrow down. I've seen issues in the past where async/await requests process out of order under heavy server loads, deadlocks in the database when a WITH(NOLOCK) was omitted and even SQL explosions when there are large numbers of records. These are really difficult to pin down locally with only limited test data. These take time and require analysing every line and working out what could happen, using SQL profiler to analyse the SQL query generated by LINQ.
Once we find the cause of the problem, we then have to find a reliable way for our QA department to replicate it in the test environment or set up the data so that it will quickly fail. They can then go through and write up test cases and plan their tests while we work on a solution. The solution could be a one-line change or a single character (I've seen significant problems caused because a > should have been a >=). Other solutions could be more complex, involving multiple lines of code and/or a code rewrite. They can include scripts to fix data in the database.

Once we have a solution, depending on how complex it is, we usually talk through the problem and solution with the rest of the team and QA and analyse any risks, how we test and verify the fix and how quickly it needs to be deployed. We will then get the fix code reviewed by peers, checked in, merged to master and deployed to the test environment where our QA team will do the first of their tests. Once it passes the QA environment, the fix is rolled out to the stage environment, where it is tested again, and then to production, where it is tested yet again before notifying the customer.
As developers, we also look at how this error came to be missed. Do we lack unit test coverage, need more automation testing, or was something missed during regular QA testing? We would look at the unit tests and add some if we think it'll help prevent issues in the past, or add some automation tests if the problem was caused by doing things outside the typical workflow, for example, adding an item to a sales order before a location has been entered.
Once the fix is deployed to production and we are happy the issue is resolved, we close the ticket, remove the production database copy if we had one and take a well-earned breather!