I recently had the ‘pleasure’ of an unresponsive vm that had basically turned into a ZOMBIE !
It could not be restarted, edited, snapshot’d and was of course, unresponsive from an OS perspective (No RDP, ping or comms just totally off the air)
Any attempted operation from the vSphere client returned “Another task is already in progress”
Similar results from remote PowerCLI;
Active tasks pane from the vSphere client;
I wanted to see if anything looked unusual so logged onto the esxi host this guest was running on and tail’d the vmware.log from within the volume/folder the guest files live in.
Nothing that would explain what task was allegedly still running, and nothing really sinister to point to the root cause.
Not cool, but it was clear we were going to have to get brutal and force kill it. In this case, we were fortunate that this guest was not particularly important.
The on-host esxcli tools are helpful in this type of scenario and basically allow you to use 3 levels of guest ‘kill’ types (could also be done from VMA)
1. soft (the most graceful and first option to try) 2. hard (immediate shutdown) 3. force (forced shutdown, power cut)
Begin by finding the World-ID for the affected guest using the esxcli vm process list command piped through grep to filter out all except the name of the guest and the next 2 lines to include the World-id
1 |
esxcli vm process list | grep -A 2 SHELLY-PC |
the ‘A’ operand in grep specifies the # of lines to include after the return of the search string. You can also use ‘B’ to return lines before the result, but that’s not necessary here.
So now we know the World-Id for our zombie guest is 36218516 so we’ll first attempt a soft kill using;
1 |
esxcli vm process kill --type=soft --wordld-id=36218516 |
This fails so let’s use the bigger hammer, the hard parameter.
1 |
esxcli vm process kill --type=hard --wordld-id=36218516 |
This fails too, so it’s time to break out the big guns, the force switch.
1 |
esxcli vm process kill --type=force --wordld-id=36218516 |
This works and the guest is now showing as shutdown.
Power it back on to verify;
Looks a bit happier ! but at this point, consider running FS level check tools (chckdsk or fschk etc). Any databases etc should also be verified for consistency also.
Unfortunately for this vm, it’s time was up and it was deleted anyway 🙁
While this will not show you the root cause, it allows you to get your vm back online quickly and the RCA investigation can begin in earnest.