Notes for September 19, 2011 meeting

66 views

Skip to first unread message

Igal Koshevoy

unread,

Sep 20, 2011, 1:01:54 PM9/20/11

to pdxdevops

We had a great meeting last night with lots of discussion. Below is a
list of some topics, products, and tools we discussed.

Also, pdxdevops is now a year old (actually, 1 year and 2 months). Yay!

Server deployment and provisioning tools:
* Cobbler, a Linux installation server -- https://fedorahosted.org/cobbler/
* Foreman, a node provisioner plus a Puppet external node classifier
and reporting tool -- http://www.theforeman.org/
* Katello, a new alternative to Red Hat Satellite Server management
system, combining Foreman for provisioning/reporting, Pulp for
repository management, and Candlepin for subscription management --
http://www.katello.org/
* CloneZilla, a disk image-based cloning system -- http://clonezilla.org/

Hypervisors and virtualization containers:
* KVM, a full virtualization hypervisor for Linux that was generally
liked -- http://www.linux-kvm.org/page/Main_Page
* LXC, a lightweight virtualization system that runs user-space Linux
Containers described as "chroot on steroids" that many are excited
about -- http://lxc.sourceforge.net/
* VMware ESXi, a bare-metal full virtualization hypervisor, which has
great features, but the latest version's licensing changes can make it
cost 8x more, making it prohibitively expensive --
http://www.vmware.com/products/vsphere/esxi-and-esx/overview.html
* Xen, still a major player but many expressed frustration with how
difficult it is to get a fast, stable version of it running --
http://xen.org/

Private clouds, grids, and batch processing systems:
* Condor, a full-featured batch processing system started in 1988 that
inspired much of the later cloud movement --
http://www.cs.wisc.edu/condor/
* Eucalyptus, an EC2-like stack that works well now --
http://www.eucalyptus.com/
* OpenStack, a new computing and storage stack that shows great
promise but is rapidly evolving and has stability issues --
http://www.openstack.org/
* CloudStack, a feature-rich stack with many unique features, but a
rapidly changing API -- http://cloudstack.org/
* VMware vCloud, a closed-source offering based on VMware's proven
products -- http://www.vmware.com/products/vcloud/overview.html
* OpenNebula, a stable platform funded by various EU international
scientific projects -- http://opennebula.org/
* Abiquo, a new stack with broad hypervisor and VMware support --
http://www.abiquo.com
* VMware CloudFoundry, a new open source platform-as-a-service for
running virtualized applications that shows great promise --
http://www.cloudfoundry.com/
* CycleCloud, a stack for managing Condor more easily --
http://www.cyclecomputing.com/cyclecloud/overview
* Globus, a toolkit for building grids -- http://www.globus.org/toolkit/

Sysadmin tools:
* Task Spooler ("ts"), a simple, no-configuration batch processor that
lets you easily queue and manage jobs from the command-line on a
single system, like a better "at" -- http://goo.gl/mV2vI
* GNU Screen ("screen"), an awesome text-based window manager --
http://www.gnu.org/s/screen/
* Byobu ("byobu"), an improved GNU Screen with detailed status
information and sensible keyboard shortcuts, used by default on latest
Ubuntu -- https://launchpad.net/byobu && http://goo.gl/GC9S
* Ruby Version Manager ("rvm"), provides a way to install and switch
between multiple copies of Ruby --
http://beginrescueend.com/rvm/install/
* Midnight Commander ("mc"), a text-based file manager inspired by
Norton Commander. Discussed how to run "mc" in a terminal session,
which is challenging because it relies on keys like "F1" that some
terminal emulators bind for their own use. Suggestion was to either
run it within a minimalistic terminal emulator like "xterm" that
doesn't bind these keys, or rebind all the interfering keys on the
terminal emulator (e.g. Gnome Terminal makes this relatively easy). --
https://www.midnight-commander.org/

Hardware platforms and operating systems:
* Cisco UCS (Unified Computing System), a platform used for Cisco's
blade servers that provides very fast, easy and consistent setups, but
at a premium cost that some said was worth it over Dell's offerings --
http://www.cisco.com/en/US/products/ps10265/index.html
* SPARC and Solaris, everyone was fleeing them. Most migrated to
Linux, claiming that commercial vendors have ported the key Solaris
apps. A few people that needed fancier hardware chose IBM's AIX on the
Power architecture.

Monitoring:
* SLAC Network Monitoring Tools, a huge list of monitoring tools --
http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
* Ganglia, a tool for capturing and graphing metrics -- http://ganglia.info/
* Nagios, still the most common choice for monitoring -- http://www.nagios.org/
* Icinga, a Nagios fork with a modern desktop and mobile web UI,
broader database support, and full backwards compatibility with Nagios
plugins -- https://www.icinga.org/
* ZenOSS, a compelling monitoring platform with a nice UI and API, but
has performance problems -- http://www.zenoss.com/
* OpenNMS, a network management system -- http://www.opennms.org/
* Nimsoft, a monitoring system that was much disliked --
http://www.nimsoft.com/solutions/nimsoft-monitor
* Graylog2 and GELF, a fast, modern logging system and structured log
format -- http://www.graylog2.org/ &&
http://thechangelog.com/post/3504643018/graylog2-server-java-ruby-mongodb-log-management
&& http://graylog2.org/about/gelf
* Scout, a problematic hosted monitoring system, e.g. it'd take nearly
an hour for it to report problems -- https://scoutapp.com/
* ServerDensity, a nifty hosted monitoring system that lets you write
plugins in Python or C# -- http://www.serverdensity.com/
* LogRhythm, a hardware appliance or software-based monitoring system
that's particularly good for monitoring Windows and its apps --
http://www.logrhythm.com/

Scalable, distributed search engines:
* Lucene, search engine library -- http://lucene.apache.org/java/docs/index.html
* Solr, popular search engine platform built on Lucene --
http://lucene.apache.org/solr/
* Elastisearch, a newer search engine platform built on Lucene with
emphasis on scalability -- http://www.elasticsearch.org/
* Katta, another search engine platform built on Lucene with emphasis
on Hadoop-interoperability -- http://katta.sourceforge.net/

Speedy memory and disk-based data stores:
* Terracotta, libraries for scaling Java applications --
http://www.terracotta.org/
* MongoDB, a document store with many features -- http://www.mongodb.org/
* Riak, a key-value store designed for scaling -- http://www.basho.com/

Fast RAM and flash-based disks:
* RamSan: http://www.ramsan.com/
* Fusion-io: http://www.fusionio.com/

Miscellaneous non-devops topics discussed as examples of tangentially
related problems or solutions:
* Notification error cascades at Three Mile Island: "What happened at
Unit 2 was a little more complex. A cascading series of events caused
the computer to notice SEVEN HUNDRED things wrong in the first few
minutes of the accident. The ONE audible alarm started ringing and
stayed ringing continuously until someone turned it off as useless.
The ONE visual alarm was activated and blinked for days, indicating
nothing useful at all. The line printer queue quickly contained 700
error reports followed by several thousand error report updates and
corrections. The printer queue was almost instantly hours behind, so
the operators knew they had a problem (700 problems actually, though
they couldn’t know that) but had no idea what the problem was."
http://www.cringely.com/2009/03/three-mile-island-memories/

* Cascading failures, bad design, poor maintenance and mismanagement
in 2009 Sayano–Shushenskaya hydroelectric power station accident:
"There was a loud bang from turbine 2. The turbine cover shot up and
the 920-tonne rotor also shot out of its seat. [...] The turbine hall
and engine room were flooded, the ceiling of the turbine hall
collapsed, 9 of 10 turbines were damaged or destroyed, and 75 people
were killed. [...] there was no power, none of the protection systems
had worked [...] turbine vibrations which led to the fatigue damage of
the mountings [...] at least six nuts were missing from the bolts
securing the turbine cover. After the accident 49 found bolts were
investigated from which 41 had fatigue cracks. [...] the turbine
blades were welded, because, after a long period of operation, cracks
and cavities had appeared; however, the turbine wheel was not properly
rebalanced after these repairs had been completed [...] none of the
workers present wanted to make or had no authority to make decisions
about further actions regarding the turbine [which had been operating
in a dangerous state for six months, and was noticeably failing for
two months]." http://en.wikipedia.org/wiki/2009_Sayano-Shushenskaya_hydro_accident

* Vacuum tubes in Soviet aircraft: "The majority of the on-board
avionics were based on vacuum-tube technology, not solid-state
electronics. Although they represented aging technology, vacuum tubes
were more tolerant of temperature extremes, thereby removing the need
for providing complex environmental controls inside the avionics bays.
In addition, the vacuum tubes were easy to replace in remote northern
airfields where sophisticated transistor parts might not have been
readily available. With the use of vacuum tubes, the [...] radar had
enormous power – about 600 kilowatts. As with most Soviet aircraft,
the MiG-25 was designed to be as rugged as possible. The use of vacuum
tubes also makes the aircraft's systems resistant to an
electromagnetic pulse, for example after a nuclear blast."
http://en.wikipedia.org/wiki/Mikoyan-Gurevich_MiG-25