PGTS Humble Blog
Thread: Mac OS X, Apple
|Open the pod bay doors, please HAL|
Apple Blue Screen Interminable Loop (BSIL)
Chronogical Blog Entries:
Date: Tue, 14 Jun 2011 00:42:10 +1000
Any computer jockey born before 1990 would have heard of the infamous Blue Screen Of Death (BSOD), a problem which was widespread during the early days of Windows NT and apparently continues even to this very day. Although Microsoft is famous for pioneering the BSOD and taking it to unparalleled levels of pulchritude at the turn of the century ... But these days, it seems Apple have developed their own Blue Screen of Infamy.
In fact the there is not only just one Apple Blue Screen, there are several. The example that I have personally encountered could be more correctly labelled the Blue Screen Interminable Loop (BSIL).
My first encounter occurred about five months ago ... A Mac Pro workstation would not start after it had been powered down. When the power button was switched on, the Apple "Chime" would sound, the white screen would show the Apple logo and then a blue screen would appear with a small turning circle of dots in it. In better times (a few hours before this first incident) this would have disappeared and the login menu window would have appeared. But on this occasion the blue screen would disappear then reappear ... And the whole process would repeat every seven seconds or so ... Interminably!
I was to discover later that the program that starts the login window (/usr/libexec/UserEventAgent) was causing this by crashing and then restarting ... And repeating it over and over ... It was a perpetual loop, from which there was no escape ... I have found reference to the problem in several places on the Internet, so it is a known "issue". Although there doesn't seem to be a comprehensive solution for it.
It seems that whenever a machine gets locked into one of these interminable loop death spirals it is still possible to use telnet or ssh (if you've enabled them) to connect to the stricken machine. (Note: Assuming of course that your machine is behind a firewall that blocks external IP addresses from using either of those services).
And so that's what I did ... I started a telnet session ... And had a look around for symptoms. I discovered some messages in /var/log/system.log which indicated that the problem was with the "login agent". I also discovered that I could fix the problem with a "reboot" command. For some reason if the machine was started by "power-up", it would fall into the login agent death spiral, but a "sudo reboot" command issued from a telnet session after the cold-start would break the spiral and it would recover. Although, as I was to discover, this behaviour was not consistent.
For a while, I just rebooted the workstation with the "reboot" command whenever I needed to. Eventually after one of Apple's software updates, the problem "went away".
It's always good when problems "just go away". And generally when they do I am happy to let them "just go" ... If only they'd stay away!
But when it came back it was much more serious ... It began when a user got "stuck" while trying to launch his browser. I had a look at the machine, could not launch anything ... And decided to "reboot" the machine ... But it wouldn't start right away ... I rebooted again ... (Note to self: Next time connect to the machine with telnet and try to figure out what is going on!)
Eventually the machine started up ... And back came the dreaded BSIL (Blue Screen Interminable Loop) ... And this time, the "reboot" command didn't get me out of trouble!
Oh Dear! -- Time to RTFM! (Read the Fabulous Manpage)
After connecting to the crippled machine with telnet, I inspected the log /var/log/system.log ... It looked as though, the user had tried to send a print to an old CUPS destination and CUPS had got itself in a tangle. And then with my great heavy thumping fingers I had intervened and rebooted it in the middle of something complicated ... Somehow the launchd daemon had got itself in a knot. Towards the end of the logfile I discovered the following scary message:
Jun 12 18:46:57 pgts07 com.apple.launchd (com.apple.UserEventAgent-LoginWindow): Job appears to have crashed: Segmentation fault
After this message there would be another messages about "Throttling respawn" ... The job would be restarted in seven seconds ... And it would crash again ... And so it went ... An interminable death spiral. There was a crash log. But it was singularly unhelpful. (Note: A segmentation fault usually means that a program, written in C, has been given a "NULL" pointer ... And unless you were the developer who wrote the program, it doesn't tell you much about what actually went wrong).
So as a result of these difficulties it seemed that the BSIL problem would occur whenever launchd tried starting the login window.
I tried experimenting with the launchctl command and inspecting the settings of the Login Window. This is controlled by the launch agent in /System/Library/LaunchAgents/com.apple.UserEventAgent-LoginWindow.plist. I read the man pages, examined settings with launchctl, read settings with the defaults command, dug into arcane and inscrutable logfiles and did some searching on the Internet ... And still I was none the wiser ...
So, grasping at straws, I decided to try running software update ... Since it got me out of trouble last time. But how to do it with only a telnet login? Surely there must be a "command" to carry out software updates? It turns out there is ... And it is called (pedagogically) softwareupdate (read about it in the man pages!). I tried running the following command line:
This informed me that there was one recommended software update (iTunesX-10.3.1). I then tried to install it with the command:
softwareupdate -i -r
But the update froze ... Ominously with another segmentation fault.
Note: This was the first time I had ever run softwareupdate from the command line.
Eventually I discovered that I could get out of the BSIL spiral by using "Safe Boot Mode". This is activated by holding down the "Shift" key when the machine is powered on. It is much slower, but in this case, seemed to get things back to the normal GUI screen. This raises the question of whether my earlier problem could have been solved by using "Safe Boot Mode"?
But then I discovered something even more peculiar.
After logging in and starting the GUI version of software update ... The machine almost froze again ... Eventually it returned with the message that there was a "network problem and I should try again later". I tried running the Apple diagnostic tool which returned a diagnoses that that I should "restart the device between the computer and the Internet" (the firewall) ... It seemed that the software update was not getting through my firewall ... (??)
This is where it gets rather strange! A couple of years ago I wrote "sinbin" script which runs on the firewall. It monitors client hosts for any signs of impolite behaviour (especially events which might trigger repeated malformed DNS queries and/or port scanning). Offenders are black-banned with IPTABLES ... And the IPTABLES rules on the modem (which uses Linux firmware) are simultaneously updated with the same black list. I wrote this script after an unfortunate incident involving an attack by Microsoft zombies.
I tried to discover where (on the Internet) the software updates were coming from. This proved very difficult. There is not much information about it. Eventually I found a file which specified the software updates preference in /Library/Preferences/com.apple.SoftwareUpdate.plist. It is stored as a binary object. In order to read it you can use the defaults command. According to the man page for defaults you need to specify the domain. However you can specify a full filename of a plist file (but don't include the .plist extension) ... For example:
defaults read /Library/Preferences/com.apple.SoftwareUpdate
Inside this file, I found the address of the Apple software update server. It would be one of several constantly changing Akamai (media server) addresses. I searched for them in my firewall logs to see if they had been clobbered by the sinbin script ... But I couldn't find any of them.
However, I did find mention of an address 126.96.36.199.
Around about the same time that I first attempted the "Software Update", a client at 188.8.131.52 commenced sending packets to ports 49317, 49324, 49325 and 49354. And it was black-banned. For a first offence the penalty is small, but the length of time in the "sinbin" doubles for each subsequent offence. According to the log, the same IP address came back for a second bite at the apple, and tried scanning ports in the same range ... And was banned again (for double the amount of time).
The times of these incidents matched the times I tried running software updates ... The first time using the softwareupdate command, the second time from the menu in the GUI.
Furthermore the address 184.108.40.206 seemed to be registered to Apple. It had a DNS entry identifying it as swscan.apple.com.
The sinbin script has a list of exemptions ... These are IP addresses that will not be black-banned for policy violations. In fact now that the zombie attacks seem to have ceased I may not require such a "junk-yard dog" approach to misbehaviour around the firewall ... For the time being I add individuals to the exemption list as required. I added the swscan address to the exemptions list and started the update again ... And it completed (rather slowly). I later removed swscan.apple.com from the ip exemptions list ... And since then updates have been working.
Update 2011-06-19: A subsequent attempt to run softwareudpate from the command line has worked flawlessly, and without any unusual traffic coming from the host swscan.apple.com.