Community Services Infrastructure Standards

Legal Notice

http://creativecommons.org/licenses/by-sa/3.0/ Copyright (c) 2009 by Fedora Project. This material may be distributed only subject to the terms and conditions set forth in the CC by SA 3.0 unported license available at: http://creativecommons.org/licenses/by-sa/3.0/).

Community Services Infrastructure Standards
1. CSI Introduction
1.1. Introduction
1.2. What to do
1.3. External Sources and References
2. Language and Terms
2.1. Introduction
2.2. External Sources and References
Host Lifecycle
1. Host Lifecycle Standard
1.1. Introduction
1.2. Host Deployment
1.2.1. Physical Deployments
1.2.1.1. Physical Inventory
1.2.1.2. Firmware Updates
1.2.1.3. Physical Racking
1.2.1.4. Post Racking Checklist
1.2.2. Host Installation
1.2.2.1. Host Preinstallation
1.2.2.2. Host Kickstart
1.2.2.3. Post Kickstart
1.2.2.4. Host Certification
1.2.2.5. Host Re-certification
1.3. Maintenance
1.3.1. Monthly Updates
1.3.2. Package Integrity
1.3.3. Security Updates
1.4. Host Decommission
1.5. External Sources and References
2. Host Lifecycle Rational
2.1. Introduction
2.2. Target
2.3. Details of the Standard
2.3.1. Deployment
2.3.2. Maintenance
2.3.3. Decommission
2.3.4. Closing

Community Services Infrastructure Standards



Chapter 1. CSI Introduction

1.1. Introduction

The Community Services Infrastructure (CSI) standards are a group of documents designed to be implemented by the largest set of technology users in the world. They focus entirely around Red Hat compatable operating systems (Red Hat Enterprise Linux, Fedora, and CentOS, among others). Unlike most documentation, CSI aims to be standards based. Often checklists can be printed and checked off one by one to ensure compliance. There are three typical target audiences for CSI: System administrators; end users; and management, legal, and other oversight entities.
CSI is developed using open source methods. Any changes or suggestions should be directed to the CSI mailing list. Any questions regarding how or why to implement these changes should be directed to fedora-infrastructure-list@redhat.com.

1.2. What to do

You have been directed to read this document because it contains instructions or procedures you are expected to follow. Read through the items below and make note of any discrepancies or confusing items. It is important that you read, understand, and follow each step carefully. Do not make assumptions about whether a procedure has been completed. Check with the appropriate contacts or responsible parties to ensure proper compliance. It is recommended that you print a copy of these checklists and mark off each step as it is completed. In the event that a response to an item indicates non-compliance, make note of it and take appropriate steps to return to compliance.

1.3. External Sources and References

Chapter 2. Language and Terms

2.1. Introduction

In order to avoid ambiguity, this document has been put together to explicitly define common words found throughout CSI.

2.2. External Sources and References

Host Lifecycle



Chapter 1. Host Lifecycle Standard

Mike McGrath

Fedora Infrastructure Lead
Fedora Project
The host lifecycle standard focuses specifically on host management including areas of provisioning, updates, security, and finally decommission. For specific services, please use the service lifecycle standard. The target audience for this document is for system administrators, system engineers and system architects.

1.1. Introduction

The host lifecycle standard contains several steps for doing specific actions on hosts. This includes initially setting hosts up, keeping them up to date and removing hosts when they are no longer needed. This document is divided up into individual sections. It is good to read through the entire document once so admins can understand each task and how it fits into the greater picture. Some sections contain checklists that are to be followed every time that task is done. It might be best to print out several copies of these check lists and check them off one by one as the task is complete.
Each section below is generally written with the assumption that an actual human being will be completing those tasks. This does not scale to large numbers of systems though. Any section that can be automated via script or some other process may be automated. It is important, however, to keep those scripts in compliance with the standard.

1.2. Host Deployment

The following steps should be followed whenever deploying a new system. This includes systems that are being re-purposed or rebuilt.

1.2.1. Physical Deployments

When deploying a new physical host, the following requirements must be met. This is in addition to the normal host deployments below.

Leased or non-self hosted hardware.

Section 1.2.1 and its subsections are specifically targeted at environments where a machine is physically managed by The Fedora Project. If they are managed by a third party some of these steps are likely delegated to that provider. In that event please exclude any steps that don't apply but do ensure to verify things like the manifest comparison below.

1.2.1.1. Physical Inventory

Each of the below items is to be inventoried.
Inventory Items
CompletePartAdditional
Shipping IntegrityEnsure packaging that the server was shipped in is not damaged and that everything seems sane.
ChassisInclude serial number, part numbers, manufacturer, etc.
Drive InventoryDocument serial number, part number, size, speed, make and model of each drive in the system.
Manifest ComparisonVerify the manifest matches what the host actually contains and what was actually ordered. This includes number and speed of processors, RAM, disks, etc.

1.2.1.2. Firmware Updates

Ensure all firmware in the system is up to date prior to use.
Firmware Updates
CompletePartDescription
BIOSEnsure the bios is at the latest available version.
DrivesEnsure each physical drive is up to date.
Remote AccessEnsure any remote access modules are updated including BMC, RSA-II, DRAC, iLo, etc.

1.2.1.3. Physical Racking

While racking the server, ensure each requirement is met.
Inventory Items
CompleteStepRequirementDescription
Rail KitMustInstall rail kit in rack, install on server.
MountingMustMount server in the newly installed rail kit.
Vendor Cable ManagementShouldAny vendor provided cable management should be installed with the server (usually as an arm on the back of the server). It is acceptable to exclude this if your organizations cable management policies are in conflict with these devices or if they, for whatever reason, are physically unable to fit in the rack.
Network ConfigurationMustUsing cable management policies, run cable from your switch or drop points to the ports on the server. Ensure the cable has enough length to remain connected while the server's rail kit is in the fully extended position. Cables should be no more then 3 inches longer then this required length. Follow a common cable coloring standard (not presently covered by CSI)
KVMMayThose wishing to use a KVM (keyboard, mouse and monitor) may hook them up. Use the provided cable management and ensure the cables are no more then 3 inches longer then required to have the server in it's full out position.
Serial/RemoteShouldIn addition to the remote access modules, servers should have an additional remote management method. Serial consoles are popular for this. Ensure the cabling required for this uses provided cable management and is no more then 3 inches longer then required to have the server in it's full out position.
PowerMustOnce the host is installed, ensure power is properly supplied via the provided cable management. Cables should be no more then 3 inches longer then required for the server to be in it's full out position.
POST CheckMustOnce fully installed, power the machine on and very it will post without any errors.

1.2.1.4. Post Racking Checklist

Once the server is fully racked and installed, the following checklist must be completed.
Post Install Checklist
CompleteTaskDescription
PDU Port ConfigEnsure each PDU port is inventoried and configured. If the PDU is manageable, ensure the ports are properly labeled.
Serial ConsoleIf using serial console, ensure it is working (check bios baud rate and Cyclades baud rate).

1.2.2. Host Installation

Host installation covers all aspects of deploying an operating system and getting it prepared to be put to use. It is to be combined with the services development lifecycle document to be placed into production. There is some overlap though this host deployment section attempts to focus on dependencies to the service.

1.2.2.1. Host Preinstallation

The idea here is running a website has certain requisites, an operating system for example. This document describes the steps required before a host is ready to run those services.
Prekickstart Checklist
CompletePartAdditional
RAID ConfigurationIf applicable, set up software raid on the host.
Storage PreparationEnsure proper storage has been allocated for the host.
Memory CheckEnsure proper memory has been allocated for this host.
CPU CheckEnsure processors are present and properly allocated.

1.2.2.2. Host Kickstart

The idea here is running a website has certain requisites, an operating system for example. This document describes the steps required before a host is ready to run those services. There is a preference to using Kickstart scripts for all installations though manual installation is acceptable when an automated installation is inappropriate.
Kickstart Procedure
CompletePartAdditional
Network ConfigurationDetermine IP address (if not using DHCP), network, gateway, resolver.
Source VerificationVerify the new host has access to the Kickstart scripts and installation media.
Method Passing Installation method must be passed to the kernel using the method command. It should not be included in the Kickstart file. [1]
# CSI Docs -  http://fedorahosted.org/csi/
# Kickstart Template
# 
# This copyrighted material is made available to anyone wishing to use, modify,
# copy, or redistribute it subject to the terms and conditions of the GNU
# General Public License v.2.  This program is distributed in the hope that it
# will be useful, but WITHOUT ANY WARRANTY expressed or implied, including the
# implied warranties of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# See the GNU General Public License for more details.  You should have
# received a copy of the GNU General Public License along with this program;
# if not, write to the Free Software Foundation, Inc., 51 Franklin Street,
# Fifth Floor, Boston, MA 02110-1301, USA.

install
vnc --password VNC_PASSWORD
key --skip
lang en_US.UTF-8
rootpw Your_Temporary_Password_Here
firewall --disabled
authconfig --enableshadow --enablemd5
timezone --utc UTC

# Left commented for data protection
# Normal fresh installs should have this section uncommented for a fully automated install
#clearpart --linux --drives=xvda --initlabel
#part /boot --fstype ext3 --size=256 --ondisk=xvda
#part swap --fstype swap --size=2048 --ondisk=xvda
#part pv.01 --size=100 --grow --ondisk=xvda
#volgroup GuestVolGroup00 pv.01
#logvol / --fstype ext3 --size=100 --vgname=GuestVolGroup00 --name=root --grow
#bootloader
# end disk bits

keyboard us
selinux --permissive
mouse none
skipx
reboot

%packages --resolvedeps --nobase


%post

1.2.2.3. Post Kickstart

After the host is finished kickstarting. Ensure the following steps check out.
Post Kickstart Checklist
CompletePartAdditional
Root PasswordChange temporary root password to a standard one. This step can be skipped if root passwords are managed another way.
Memory CheckVerify the host has all the memory it should have using the free
CPU CheckVerify the host has all the cpus it should have using cat /proc/cpuinfo.
Disk CheckVerify the host has all of the disk space it should have using the fdisk, df and lvs commands.
OS CheckVerify the host has the correct operating system installed.
Resolver ConfigurationVerify /etc/resolv.conf has proper search domains and resolver configuration. At least two resolvers must be present.
Puppet ConfigurationVerify the puppet master has a node configuration file for the this host. If it does not, create one using common puppet standards. [2]
Puppet Certificate RequestRun puppetd -t to generate a new certificate request.
Sign RequestSign the new puppet request on the puppet server using puppet-ca --sign fqdn.com. Users who do not have access to sign this request should contact
Puppet InitalizationOnce the puppet certificate request has been signed, run puppetd -t. This should not take more than one run. Realistically though it may take multiple runs (See "Puppet Fixing" below).
Puppet FixingFix or report any errors or warnings that come up during puppet initialization. Puppet must run without any errors before moving on.
Reboot VerificationOnce puppet completes without error, do a final reboot and verify all services and functions are running properly.
BackupsIf this host stores information that needs to be backed up, contact and open a ticket at https://fedorahosted.org/fedora-infrastructure. Include all information about paths and why it needs to be backed up.
MonitoringEnsure this host is properly monitored. Contact and open a ticket at https://fedorahosted.org/fedora-infrastructure. New host types may require additional information as to what needs to be monitored.
Host CertificationOnce the host has been built and is ready, contact a host certifier at . If you are a host certifier and have built this host, please select another member of the certification team.

1.2.2.4. Host Certification

Host certifiers must double check every step in the "Post Kickstart" section for correctness. This can be an automated process.

Automation

This step can (and should) be automated. This is especially true for large install bases. Just make sure the scripts are maintained and kept up to date.
Post Kickstart Checklist
SignoffPartAdditional
Host VerifiedEvery part of the Post Kickstart checklist has been completed correctly.
Sign OffCreate /root/certified with the date and your name in it. Ensure your username (not root) is listed as the certifier.echo $(date) - $(whoami) | sudo cat >> /root/certified.txt

1.2.2.5. Host Re-certification

At times a host may require re-certification. For example if it's behaving improperly or if massive changes have happened to it.
Post Kickstart Checklist
SignoffPartAdditional
Host VerifiedEvery part of the Post Kickstart checklist has been completed correctly.
Sign OffCreate /root/certified with the date and your name in it. Ensure your username (not root) is listed as the certifier.echo $(date) - $(whoami) | sudo cat >> /root/certified.txt

1.3. Maintenance

This section describes various time based events to do regular maintenance on your hosts. The triggers for each event are defined as well as the corresponding action.

1.3.1. Monthly Updates

While upstream providers (like Red Hat and Fedora) have their own updates schedule, it's generally considered best practice to have one independent of that. With the exception of ongoing bugs reported by The Fedora Project and security updates, follow the table below. On the first of every month, begin the task list below. It is recommended this be automated with your current tools.
Monthly Updates
CompleteStepRequirementDescription
Check NotificationMayTest to see if servers require an update via cron with this command: /usr/bin/yum -d0 -e0 check-update
Pre-Production TestMustTest updates in pre-production environments prior to updating production.
Kernel RebootMustIf the kernel has been updated, you must reboot the host to bring all updates into effect. [3]
Errata RebootMustIf an errata or update suggests a reboot, you must do the reboot.

1.3.2. Package Integrity

This section describes a few commands that should be run regulary after updates (or via cron) to verify your system is in a valid and good state.
Package Integrity Check
CompleteStepRequirementDescription
Dependency VerificationShouldVerify all package dependencies have been met using this command: /bin/rpm -Va --nofiles --nomd5. Fix any errors printed from that command. [4]
Source VerificationShouldVerify all packages installed from this system are from an upstream repository and still valid. /usr/bin/package-cleanup --orphans [5]
Service RestartsShouldMake sure older versions of applications are no longer running in memory and have all been restarted. python needs-restarting.py [6]

1.3.3. Security Updates

As soon as a security update is released, it should be installed on all hosts that would need it.
Security Updates
CompleteStepRequirementDescription
Check NotificationMayTest to see if servers require an update via cron with this command: yum list-security --security

1.4. Host Decommission

When removing a host ensure the following steps are followed.
Decommission
CompleteStepRequirementDescription
Out Of UseMustEnsure the host and all services on the host are no longer required for any purpose. This can be accomplished by shutting the host down for a period of several days or just turning all services off.
BackupsMustEnsure any information that might be needed has been backed up. [7]
Backup CleanupShouldIf a backup was created, open a ticket specifying the location of the backup and a date for how long it should be kept around. Once that date has passed remove the backup and close the ticket.
ShutdownMustShutdown the host.
Drive ZeroingShouldOnce everything is done zero out the drives using dd if=/dev/urandom of=/dev/your_disk bs=4096. With using virtual machines this can be done directly on the lvm image. For physical hosts you will likely need to boot into a rescue image.
VM ConfigMustIf this host is a virtual machine, remove the virtual configuration. virsh undefine hostname. This step is not needed for physical hosts.
UnrackMustUn rack the server. Keep the rail kits with the server unless there is an immediate need or re-use for them. Bundle each cable individually and place them in their proper bin.

1.5. External Sources and References



[1] The method command allows you to specify install source. For example: method=http://download.fedora.redhat.com/pub/fedora/linux/development/x86_64/os/. Leaving this information out of the Kickstart files allows for easiest re-use of Kickstart files. The method parameter can also easily be placed into tftp commands and virt-install commands.

[2] The CSI puppet standards are still being written at this time.

[3] Doing regular reboots is a good thing. Rebooting as often as you get a new kernel is sufficient.

[4] This may also be part of a regularly run cron job instead of running it after every update.

[5] This command will go through all rpms on your host and make sure they are on a valid yum repository. This is important for two reasons. First it will alert admins to any rpms that have been installed manually via rpm -i. Second it will alert admins about packages that may now be obsolete or otherwise un supported and possibly out of date and insecure.

[6] This command goes through all programs running in memory and looks to see if they have any bad file descriptors open. This is important for updates that may not have restarted their applications. For example, if a security update comes out for apache, but no one restarts httpd. It's possible that even though the package has been updated, the older insecure version is still running in memory and serving your customers. Newer versions of yum-utils may have this installed already. If not you can get it from here: http://yum.baseurl.org/gitweb?p=yum-utils.git;a=blob;f=needs-restarting.py

[7] If using virtual hosts with LVM, this can be done by renaming the logical volume to something like "hostname.bak".

Chapter 2. Host Lifecycle Rational

Mike McGrath

Fedora Infrastructure Lead
Fedora Project

2.1. Introduction

The host lifecycle standard aims to provide a complete policy on host management. It includes deployment, maintenance and removal of a host. Much of the standard is written in a simple checklist form and is divided into sections. Each section can be taken individually depending on the task at hand and can easily be referenced when coordinating efforts with others or ensuring each step has been completed successfully.

2.2. Target

The host lifecycle rational chapter is intended to be read by managers, architects and other decision makers who are looking to adopt the host lifecycle standard. The standard itself is written for system administrators, engineers and architects. Some of the topics in the standard are reasonably advanced but sample commands are provided for reference.

2.3. Details of the Standard

The standard is written in 3 sections. Deployment, maintenance and decommission. Generally the deployment and decommission sections are only going to be used once during a hosts life. The maintenance section, however, is written to be cyclical and contains tasks that are to be regularly performed. Most organizations rely on specific services (like email, or calendar). These services all run on hosts. The host lifecycle focuses on keeping a clean and stable environment on which to run the critical services but it does not cover the services themselves. The services lifecycle (not yet written) focuses on proper deployment and preparation of specific services.

2.3.1. Deployment

The first sections of deployment include how to purchase, rack and cable a physical host. It involves verifying the shipping manifest to ensure what was shipped is what you got as well as matching the individual specs of a machine to what was ordered. Admins are then directed to document serial numbers, model numbers, sizes, etc of a host prior to it's actual use. Remotely managed machines can make it difficult to get this information without someone on site.
Once a host is racked and a power on start up test (POST) has completed successfully, the admin is then directed to install an operating system on the host. The standard is aware of virtualized and non-virtualized environments and treats installing a physical machine the same as a virtual one. The pre installation checklist is a quick provides a quick sanity check of what is about to be done. The kickstart section provides the basic steps to actually install the operating system and includes a sample kickstart file. Wherever possible automation is preferred to manual steps.
Once an operating system has been started, the post kickstart checklist has a basic task list to verify the host is ready to be used. Again, focus on automation is important. This standard calls for a very simple kickstart script that then uses puppet for configuration management. Once the installer has verified everything is working they call on a certifier to certify the install is as it should be. This is a quality check in the chain but also ensures that multiple people are involved with the installation process keeping more people in the loop then just letting one person go off and do as they wish.

2.3.2. Maintenance

After a host has been installed it needs to be kept up to date. This standard lists a monthly update cycle with regular package integrity checks. These checks make sure the packages installed on the system come from a trusted source as well as alerting administrators to obsolete or unsupported packages so they can be handled. There are also checks in place to make sure services are restarted after an update to protect against potential service violations. Many of these checks are encouraged to be run in cron so the admins don't have to do additional work. Once these checks are in place they will alert the admin on their own when actions need to take place.

2.3.3. Decommission

Once a host has reached the end of its life, it is time to decommission it. This section of the standard mostly focuses on tracking old hardware until it can be recycled or disposed of. It also has steps to ensure whatever data was on the host is cleanly wiped so anyone who happens upon the old hardware will get to the data. Additional steps for backups and verification help ensure old hosts are no longer being used.

2.3.4. Closing

If you think the host lifecycle standard is right for your organization, please talk to your senior technical staff about its implementation. The CSI standards are openly developed and challenges and questions are openly discussed. The aim is to ensure common best practices to be followed. The host lifecycle can help ensure your staff isn't re-inventing a wheel that others have already invented and are using. Also it means that when questions or other technical limitations are met, administrators have a pool of resources to turn to that are external to your organization thereby increasing the pool of knowledge to your organization and lowering costs.