Sometimes system administrators face the problem of restoring the system after a full crash or an irreparable problem, for example, when the password for the root user is lost, or when the filesystem is destroyed. The best piece of advice to be given in this situation is: Don't panic. Anyone can make a mistake, and even a very stupid one. The best way to learn the art of system administration is by making mistakes, although this also is a very difficult way.
The Linux system is a stable enough version of UNIX. When working with Linux, the author of this manual has had fewer problems with the system freezing than while working with proprietary versions of UNIX for different platforms. Another advantage of Linux is that there are many experts willing to help by sharing their knowledge over various computer networks.
The first step towards solving any problem is to determine its cause. Before crying out for help, you should examine the problem as a whole and analyze which parts of the system are still working. It sometimes turns out be to quite easy to fix the problems without anyone's help. Beside, this is the way towards further knowledge and improvement of skills.
It rarely happens that the whole system needs to be re-installed “from scratch”. Many inexperienced users begin reinstalling the entire system as soon as they delete some important system file. This method is not recommended. Before resorting to such radical measures, you should investigate the problem and ask for help. In many cases it is possible to restore the system using the special rescue mode, available when loading from the boot CD of the distribution, or by using a maintenance diskette, created during system installation.
If errors are generated during the system startup process (for instance, system services display the message FAILED), this may indicate an emergency situation in Linux. If this happens, the computer may no longer be used in its regular working mode (accessing WWW servers, network disks, establishing terminal access sessions, etc.), until the errors are corrected or it is verified that the dysfunctional service does not affect the behavior of the other parts of the system.
First, you should carefully examine the messages displayed onto the first virtual console (it is available by pressing Alt-F1 or Ctrl-Alt-F1 key combinations) during startup. The appropriate combination of keys should be pressed during the system startup, before the earliest error messages are replaced by others. Console contents can be scrolled using Shift-PageUp and Shift-PageDown key combinations. Kernel diagnostics, displayed before anything else, can be seen after startup as well, using the dmesg(8) command.
It is by no means true that all errors are fatal for the system. Often only one individual part of the system fails (for instance, the WWW server) and this, too, may be caused by a simple misconfiguration. It sometimes happens that the system is set up manually as a result of certain experiments, and its current state is only partially reflected in the configuration file. In this case, it is recommended to start the system and try to reconfigure the service that fails. Remember to look at the twelfth virtual console and at the system logs.
Strange behavior of the Linux system may be caused by an overflow of files sent to a critical filesystem. The most common reason for an overflow is the following: some system log is not rotated automatically, so it grows in an uncontrolled way. This condition is checked by searching for a large file in /var/log that has no queue of obsolete copies. Solution: cut this file to an acceptable size (at least with a text editor) and organize the log rotation mechanism using newsyslog.
Reason number two: the administrator does not read system mail, which is constantly sent to the superuser (root user) of any Linux system. This causes uncontrolled growth of the mailbox. This condition can be checked by searching for a large mailbox in /var/spool/mail (it does not always have to be the root mailbox, since such mail is usually forwarded to one of the real users of the system). Solution: read (or at least look through) the root mail and delete old messages.
Reason number three: some naive user is given write access to the root filesystem, and uses it to write a lot of large files there. This happens when the administrator himself (or herself) is naive, or when the users' home directories (/home/) are not in a separate filesystem (and therefore, they are in the root filesystem). Solution: you can set disk quotas for users (for detailed information, see the mount and edquota manpages). While a naive user can do some reparable damage, no protection exists against a naive administrator.
If the system fails to start correctly, but neither the consoles nor the system logs contain any messages about hardware errors, this means that one of the vital parts of the system has failed. If the diagnostic messages contain the text: RUN fsck MANUALLY, then, possibly, nothing fatal has happened. During system startup, Linux checks the filesystems state (for instance, after an emergency system shutdown, or just to be on a safe side). If the filesystem has not been unmounted correctly, garbage may remain in it in the form of unclosed files which occupy disk space and have no names. Such garbage can be cleaned out automatically, which is done by the fsck program.
If the computer power cable is unplugged at an especially unfortunate moment, erroneous metadata, i.e. incorrect data about files, may be created in the filesystem. Correcting metadata is always somewhat risky: existing files can be lost, their size can be changed, etc. Thus the fsck command is not responsible for automatic correction of metadata, but, instead, asks the system administrator to do that. He or she should run the fsck command, answer y (meaning “yes”) to all questions about deleting files or changing their sizes, and examine the contents of /lost+found/ folders in each filesystem (lost files should be found there). Remember that the cures offered by the fsck command may in some really bad cases lead to death of the patient (i.e. the filesystem). Thus, it is desirable, if possible, to mount the failing filesystem in read only mode before doing anything else, and create a backup copy of the data that can still be read.
In cases of serious problems in the startup process, it is recommended to carefully analyze the displayed messages, to draw appropriate conclusions and to boot the system up from another device. After this it is advised to try curing the system. The best way to do this is a bootable disk (the distribution CD being one example) in the Rescue mode. In this mode, a minimal version of Linux is usually available, which does not use hard drives (the filesystem is mounted in the operating memory of a computer), but gives access to all utilities needed to correct errors. In ALT Linux, the rescuer even tries to mount the working filesystems on a dysfunctional computer, but never uses them for writing.
Apart from the bootable distribution CD, you may use a bootable floppy disk created during the installation of Linux. A floppy disk has a much smaller volume, and its objective is more modest: to store the kernel and a set of drivers, required for the proper startup of your computer. After you have booted from a floppy disk you may need to mount the root filesystem (if it is functional), or the rescuer mentioned above.
If the console displays messages about hardware errors, this means you are facing a serious problem. If the errors are related to the hard drive, you should first see how this failing drive is handled by other computers. If it fails on other computers as well, back up everything that can still be read as soon as possible and replace the hard drive with another one. Also, you should consider the reasons why it could have failed. The most probable causes are overheating (this is especially true of high-speed drives) and electric power surges.
If disk errors occur on one computer only, the problem is not in the drive itself, but in its compatibility with the other hardware components, or with defects in those components. Most hardware problems cannot be fixed using Linux tools, but in some critical cases (for instance, when you need to access system journals), failsafe system mode may be used. In the failsafe mode, the system uses only those drivers and kernel settings that are necessary for system functioning, while service functions and everything that increases performance speed is disabled. As a rule, the failsafe mode is offered by the bootloader as an alternative to the normal mode. But it can also be accessed using the boot parameters (bootparam), for example noacpi and noapic.
And, finally, if the computer is in an awful state, but the system disk has survived, the system can be started easily enough on another computer. You may only need to re-define some drivers (if the hardware configuration of the new computer differs from that of the previous one), and, if the hard drive needs to be re-labelled in the new environment, /etc/fstab and the initial bootloader also need to be corrected. But, in most cases, you will get a working system right away, although the network and/or graphical subsystems may remain unconfigured.