If you are operating the network for 10,000's of demanding gamers, you need toreally know what is going on inside your network. Oh, and everything needs to bebuilt from scratch in just five days.
If you have never heard about DreamHack before, hereis the pitch: Bring 20,000 people together and have the majority of them bringtheir own computer. Mix in professional gaming (eSports), programming contests,and live music concerts. The result is the world's largest festival dedicatedsolely to everything digital.
To make such an event possible, there needs to be a lot of infrastructure inplace. Ordinary infrastructures of this size take months to build, but the crewat DreamHack builds everything from scratch in just five days. This of courseincludes stuff like configuring network switches, but also building theelectricity distribution, setting up stores for food and drinks, and evenbuilding the actual tables.
The team that builds and operates everything related to the network isofficially called the Network team, but we usually refer to ourselves as techor dhtech. This post is going to focus on the work of dhtech and how we usedPrometheus during DreamHack Summer 2015 to try to kick our monitoring up anothernotch.
Obviously just connecting all these computers to a switch is not enough. Thatswitch needs to be connected to the other switches as well. This is where thedistribution switches (or dist switches) come into play. These are switches thattake the hundreds of links from all access switches and aggregate them intomore manageable 10-Gbit/s high-capacity fibre. The dist switches are thenfurther aggregated into our core, where the traffic is routed to itsdestination.
Since the network needs to be built in five days, it's essential that themonitoring systems are easy to set up and keep in sync if we need to do lastminute infrastructural changes (like adding or removing devices). When we startto build the network, we need monitoring as soon as possible to be able todiscover any problems with the equipment or other issues we hadn't foreseen.
In the past we have tried to use a mix of commonly available software such asCacti, SNMPc, and Opsview among others. While these have worked they have focused onbeing closed systems and only provided the bare minimum. A few years back a fewpeople from the team said "Enough, we can do better ourselves!" and startedwriting a custom monitoring solution.
At the time the options were limited. Over the years the system went from usingGraphite (scalability issues), a custom Cassandra store (high complexity), andInfluxDB (immature software) to finally land on using Prometheus. I firstlearned about Prometheus back in 2014 when I met Julius Volz and I had beeneager to try it ever since. This summer we finally replaced the customInfluxDB-based metrics store that we had written with Prometheus. Spoiler: We'renot going back.
The monitoring solution consists of three layers:collection, storage, presentation. Our most critical collectors aresnmpcollector (SNMP) and ipplan-pinger (ICMP), closely followed by dhcpinfo(DHCP lease stats). We also have some scripts that dump stats about othersystems into node_exporter'stextfile collector.
We use Prometheus as a central timeseries storage and querying engine, but wealso use Redis and memcached to export snapshot views of binary information thatwe collect but cannot store in Prometheus in any sensible way, or when we needto access very fresh data.
We continued to use memcached this year for our low-latency data, while usingPrometheus for everything that's historical or not as latency-sensitive. Thisdecision was made simply because we were unsure how Prometheus would perform atvery short sampling intervals. In the end, we found no reason for why we can'tuse Prometheus for this data as well - we will definitely try to replace ourmemcached with Prometheus at the next DreamHack.
The block that so far has been referred to as Prometheusreally consists of three products:Prometheus,PromDash, andAlertmanager. The setup is fairlybasic and all three components are running on the same host. Everything isserved by an Apache web server that just acts as a reverse proxy.
So at this stage we had something we could query forthe state of the network. Since we are humans, we don't want to spend our timerunning queries all the time to see if things are still running as they should,so obviously we need alerting.
For example: we know that all our access switches use GigabitEthernet0/2 as anuplink. Sometimes when the network cables have been in storage for too long theyoxidize and are not able to negotiate the full 1000 Mbps that we want.
The negotiated speed of a network port can be found in the SNMP OIDIF-MIB::ifHighSpeed. People familiar with SNMP will however recognize thatthis OID is indexed by an arbitrary interface index. To make any sense of thisindex, we need to cross-reference it with data from SNMP OID IF-MIB::ifDescrto retrieve the actual interface name.
Fortunately, our snmpcollector supports this kind of cross-referencing whilegenerating Prometheus metrics. This allows us in a simple way to not only querydata, but also define useful alerts. In our setup we configured the SNMPcollection to annotate any metric under the IF-MIB::ifTable andIF-MIB::ifXTable OIDs with ifDescr. This will come in handy now when we needto specify that we are only interested in the GigabitEthernet0/2 port and noother interface.
While alerting is an essential part ofmonitoring, sometimes you just want to have a good overview of the health ofyour network. To achieve this we used PromDash.Every time someone asked us something about the network, we crafted a query toget the answer and saved it as a dashboard widget. The most interesting oneswere then added to an overview dashboard that we proudly displayed.
While changing an integral part of any system is a complex job andwe're happy that we managed to integrate Prometheus in just one event, there arewithout a doubt a lot of areas to improve. Some areas are pretty basic: usingmore precomputed metrics to improve performance, adding more alerts, and tuningthe ones we have. Another area is to make it easier for operators: creating analert dashboard suitable for our network operations center (NOC), figuring outif we want to page the people on-call, or just let the NOC escalate alerts.
Some bigger features we're planning on adding: syslog analysis (we have a lot ofsyslog!), alerts from our intrusion detection systems, integrating with ourPuppet setup, and also integrating more across the different teams at DreamHack.We managed to create a proof-of-concept where we got data from one of theelectrical current sensors into our monitoring, making it easy to see if adevice is faulty or if it simply doesn't have any electricity anymore. We'realso working on integrating with the point-of-sale systems that are used in thestores at the event. Who doesn't want to graph the sales of ice cream?
Finally, not all services that the team operates are on-site, and some even run24/7 after the event. We want to monitor these services with Prometheus as well,and in the long run when Prometheus gets support for federation, utilize theoff-site Prometheus to replicate the metrics from the event Prometheus.
A huge shout-out to everyone that helped us in #prometheus onFreeNode during the event. Special thanks to BrianBrazil, Fabian Reinartz and Julius Volz. Thanks for helping us even in the caseswhere it was obvious that we hadn't read the documentation thoroughly enough.
Finally, dhmon is all open-source, so head over to have a look if you're interested. If you feel like you would like to be apart of this, just head over to #dreamhack onQuakeNet and have a chat with us. Who knows, maybeyou will help us build the next DreamHack?
DreamHack r i dag en 23 r lng tradition som sg sin begynnelse i skolkafeterian p Malungs grundskola. Sedan dess har LAN-partyt bara vxt. I r hade DreamHack totalt en kvarts miljon beskare och streamade sammanlagt 80 miljoner timmar p 375 miljoner tittare. Det r inte fr inget som det kallar sig vrldens strsta digitala festival.
Fr 2018 vntar ett minst lika stort r med arrangemang i tta stder runtom Amerika och Europa. Frst ut r Leipzig, Tyskland, i januari fljt av Tours, Frankrike, i maj. Fr en detaljerad lista och lnkar, se schemat nedan.
Spel, spel, spel. De finns alltid till hands nr man krver en dos av underhllning, verklighetsflykt eller engagemang i ngonting som vi kan koppla av, eller glmma bort sina bekymmer med. Spel r ocks en vldigt ung underhllningsform i frhllande till sin nrmaste slkting, filmbranschen, och har drfr en lng vg kvar att vandra innan all dess potential kommer att ha utforskas till fullo. Vi knner till spelhistorien genom tiderna, vi har vxt upp med den och sett spelfretagen vxa med oss, men nu r det dags att blicka framt mot nya ml och stt att granska spelande p. Ett stort spelintresse r bara brjan.
Tv gnger om ret hamnar Jnkping i medias rampljus och frvandlas till ett Mecka fr alla datorspelsfantaster runt om i vrlden. Anledningen r Dreamhack som med sina drygt 21 000 deltagare kan titulera sig som vrldens strsta datorspelfestival och numera ven som ett kulturevenemang.
Nr jag och min vn kliver ur taxin p torsdagskvllen och stapplar fram mot huvudentrn r jag full av frvntan och knner mig som en otlig snorunge p julafton. Det r svrt att inte minnas perioden d man sjlv var en inbiten gamer som levde fr datorspel och LAN. Fr inte verdrivet mnga r sedan satt man i just Elmiahallens folkhav med datorn p en minimal spnskivsyta och deltog i Counter-Strike-turneringar med mlet att kamma hem s mnga prisprodukter som mjligt. Att hlla sig vaken flera dygn i strck var ingen strre bedrift och hlsan sket man fullstndigt i. Det viktigaste var att knarka koffeintabletter, ta slafsigt vrmda Billys Pan-pizzor och dricka flera backar Jolt-cola. Som tur var mognade man en aning fr varje besk och insg slutligen att en stroke vid 22 rs lder inte var nskvrd. Kosten blev mer allsidig, speltimmarna frre och smnen ett mste. Intresset fr gaming och e-sporten som en kulturform har dock besttt.
64591212e2