infrastructures: A survey Network and server resource management strategies for data centre Computer Networks

(1)

ContentslistsavailableatScienceDirect

Computer Networks

journalhomepage:www.elsevier.com/locate/comnet

Review article

Network and server resource management strategies for data centre infrastructures: A survey

Fung Po Tso

^a^,^∗

, Simon Jouet

^b

, Dimitrios P. Pezaros

^b

aDepartment of Computer Science, Liverpool John Moores University, L3 3AF, UK

bSchool of Computing Science, University of Glasgow, G12 8RZ, UK

a rt i c l e i n f o

Article history:

Received 8 March 2016 Revised 28 June 2016 Accepted 4 July 2016 Available online 5 July 2016 Keywords:

Cloud data centre Virtualisation management Network management Admission control Resource provisioning

a b s t r a c t

Theadventofvirtualisationandtheincreasingdemandforoutsourced,elasticcomputechargedonapay- as-you-usebasishasstimulatedthedevelopmentoflarge-scaleCloudDataCentres (DCs)housingtens ofthousandsofcomputerclusters.Ofthesigniﬁcantcapitaloutlayrequiredforbuildingandoperating suchinfrastructures,serverandnetworkequipmentaccountfor45and15%ofthetotalcost,respectively, makingresourceutilisationeﬃciencyparamountinordertoincreasetheoperators’Return-on-Investment (RoI).

Inthispaper,wepresentanextensivesurveyonthemanagementofserverand networkresources overvirtualisedCloudDCinfrastructures,highlightingkeyconceptsandresults,andcriticallydiscussing theirlimitationsandimplicationsforfutureresearchopportunities.Wehighlighttheneedforandben- eﬁtsofadaptiveresourceprovisioningthatalleviatesrelianceonstaticutilisationpredictionmodelsand exploitsdirectmeasurementofresource utilisation onserversand networknodes.Coupling suchdis- tributedmeasurementwithlogicallycentralisedSoftwareDeﬁnedNetworking(SDN)principles,wesub- sequentlydiscussthechallengesandopportunitiesforconvergedresourcemanagementoverconverged ICTenvironments,throughunifyingcontrolloopstogloballyorchestrateadaptiveandload-sensitivere- sourceprovisioning.

ThisisanopenaccessarticleundertheCCBYlicense.(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Cloud computing is an important IT paradigm where enter- prisesoutsource ICTinfrastructureandresourcesbasedon apay- as-you-use service model. This model relieves enterprises from significant capital expenditure (CAPEX) costs for purchasing and maintainingin-housepermanenthardwareandsoftwareassets.In- stead,theyuseoperatingexpensebudgets(OPEX)tofundtheirICT infrastructureandeliminatemaintenanceexpenses,allowingthem to focusoncorebusinessinnovation.One ofthemostimmediate benefitsofusingCloudservicesistheabilitytoincreaseinfrastruc- tural capacity swiftlyand at lower costs, therefore beingable to adapttochangesinthemarketwithoutcomplexprocurementpro- cesses, andrespondflexibly tounexpected demand.Recent years havewitnessedasignificantgrowthintheadoptionofCloudCom- puting.The publicCloudcomputingmarkethasexpandedby 14%

in2015tototalUS$175billion,accordingtoGartnerInc.[1],whilst totalspending worldwideisanticipatedtocontinueﬂourishingat aCompoundAnnualGrowthRate(CAGR)of17.7%until2016[2].

∗ Corresponding author.

E-mail address: p.tso@ljmu.ac.uk (F.P. Tso).

Underpinning Cloud Computing are virtualised infrastructures hosted over Data Centres (DCs) which are in turn maintained and managed at scale by national or global operators, such as Amazon, Rackspace, Microsoft, and Google. These implement different variations of the ^∗-as-a-Service (^∗aaS) paradigm, in- cluding Infrastructure-as-a-Service (IaaS, e.g., Amazon’s EC2 and Google’s Compute Engine), Platform-as-a-Service (PaaS, e.g., Mi- crosoft’s Azure and Rackspace’s Cloud Sites) and Software-as-a- Service(SaaS,e.g.,FacebookandGoogleDocs).

Itwasanecdotallyreportedthat thenumberofserversowned by some major Cloud service providers and operators could be morethana million[3].Theyarehosted inCloudDCs,each typically housing tens of thousands of servers [4]. In order to be sustainable, the significant capital outlay required for building a DC makes maximisation of Return on Investment (RoI) crucial, which in turn necessitates efficient and adaptive resource usage ofthevirtualisedphysicalinfrastructure[5–7].Withtheadventof virtualisationand multi-tenancy,computing resources are shared amongst multiple tenants,preventing hard resource commitment andthereforeservers frombeingidle. However,thissoftresource allocationresultsinsignificantloadfluctuationinshorttimescales duetotheebbandflowofuserdemand.

http://dx.doi.org/10.1016/j.comnet.2016.07.002

(2)

DC infrastructures are therefore vulnerable to performance degradation from factors such as network congestion and contention on shared resources. Managing such dynamism in short timescales is particularly challenging. Many Cloud applications suchas,HadooprunningoverCloudDCsexploitaﬁne-grainedhi- erarchical taskdecomposition into stages, each of which can in- volvemultiple instances running in parallel on different physical hosts and communicating between them. In some intermediate stages,computationresultsyieldedfromsubtasksaregatheredon tofewer numberof servers to produceinput forsubsequent and ﬁnalcomputestages.Inthispartition/aggregateworkpattern,itis essentialforallsubtaskstocompleteintimeinorderforajob to completesinceanyfailedsubtaskwillhavetobere-executedand keepothers“waiting” foritto complete.As aresult, thecomple- tiontimeofeachsinglesubtaskultimatelydeterminestheoverall jobcompletion time.ThisiswastefulbothforCPUcyclesandnet- workbandwidth,andcanhaveaknock-oneffectontheresponse timeofdifferentservicesanddifferenttenants.

Consequently, DC resource management has become a com- plexproblemduetotheinabilitytogatheraccurateinfrastructure- wide resource usage informationin short timescales andin turn toforecastresourceavailability.Recent researchhasrevealedthat DC workload patterns at coarse time-scales (i.e., hours) exhibit weekend/weekday variations [8], but at finer-grained timescales, the workload patterns are bursty and unpredictable [6,9,10]. The measurement results indicate that in order to adapt to transient loadfluctuation, a fine-grained temporal and spatialapproach is needed.Forfinetemporalgranularity,controlloopsareneededto obtainlevels of resourceutilisation inshort timescalesforbetter characterisingworkloadpatterns.Forspatialgranularity,individual flowssize,serveravailability,networklinkutilisation,etc.,needto bemeasuredandusedasadditionalinputtoresourceprovisioning algorithms[10].

Currently, to cope with performance variability and unpredictability,DCsare engineeredto tolerate acertain degree ofde- mandﬂuctuation by over-provisioning, or by holding certain resources as a reserve [11,12]. However, over-provisioning within Cloud DCs is expensive [4]. Alternatively, adaptive provisioning policies can be implemented to ensure that Cloud providers ad- hereto their ServiceLevel Agreements (SLAs) while maximising theutilisationoftheunderlyinginfrastructure.

Cloudresource managementrequirescomplexinstrumentation mechanisms and algorithms for multi-objective optimisation to measure and account fore.g., server, network, and power usage eﬃciency.In this paper, we provide a critical survey ofresource managementstrategiesforvirtualisedData Centreinfrastructures.

Wefocus ontwo key infrastructural aspects:the serversandthe underlyingnetwork.Thesetwopillarsnotonlyrepresentthemost costly infrastructural elements, up to 60% of the cost of a data centre[4], that need to be managed and provisioned in an effi- cientandeffectivemanner,they alsoadequatelycapturethelevel ofgranularityofresourcecontentionandmultiplexingover Cloud DCs[13].InvirtualisedDCs,virtualMachines(VMs)arethefunda- mentalentities usedbyusersoverbothpublic andprivateinfras- tructureClouds whiletraffic is multiplexedandcontrolled atthe level ofindividual flows over the DC network. We discuss man- agementstrategiesforthestaticanddynamicallocationofvirtual resourcesoverphysicalservers toimproveresponse times,power andenergyconsumption, andnetwork bandwidth utilisation.We presentthe network-widecharacteristics oftypicalDCworkloads, and we review the most prominent work on traffic engineering strategiestoachievedifferentnetwork-wideobjectives.Inaddition, wediscussdevelopmentsontrafficflowadmissionandcongestion controlforCloudDCsthatprimarilyseektoharnesstheunderlying redundancyinnetworkbandwidthtomaximiseintra-DCpairwise applicationthroughput. In each of these areas, we highlight the

limitations ofthe state-of-the-artand we discussefforts towards more adaptive andmore dynamic, closed-loop resource manage- mentandcontrol.

The remainder of this paper is structured as follows:

Section 2 presents the dominant DC network architectures and management topologies used to leverage network and server re- sourceredundancyandenablehorizontal(ratherthanvertical)in- frastructureexpansion.Wethencriticallydiscussthemostimpor- tant and inﬂuential developments on server, network, and ﬂow controlmanagementoverCloudDCs,inSections3–5,respectively.

Withineachcategory,wepresentthemainoptimisationobjectives andmaintechniquesforachievingthem.Weidentifyareasforfu- turedevelopmentandopenresearch issuesthatareyettobe ad- dressed. In Section 6, we raise the issue of the current disjoint managementandcontrolofdiversephysicalandvirtualresources intheDC,andwediscusstheineﬃcienciesthislackofsynergybe- tweencontrolmechanismscanleadto.Wethendescriberesearch challenges and opportunities for converged control and resource managementforvirtualisedCloudDCs.Finally,Section7concludes thepaper.

2. Datacentretopologies

In this section, we provide a critical review of the dominant CloudData Centre(DC)network architectures, outliningtheir op- erationalcharacteristics,limitations,andexpansionstrategies.

2.1. ConventionalDCarchitecture

Conventional DC architectures are built on tree-like hierarchy withhigh-density,high-costhardware[14],asdepictedinFig.1a.

The network is a tree containing a layer of servers arranged in racks at the bottom. Each server rack typically hosts 20–40 serversconnectedtoaTopofRack(ToR)switchwitha1Gb/slink.

EachToRswitch connectstotwo aggregationswitchesforredun- dancy.Forthesamereason, aggregationswitchesconnect tocore switches/routersthatmanagetraffic inandoutoftheDC.Thehi- erarchicalconfiguration ofthenetwork topologymeans that traf- ficdestined to serversindifferentracksmustgo throughtheag- gregationorthecoreswitchesofthenetwork.Therefore,aggrega- tionswitchesusuallyhavelargerbuffersaswellashigherthrough- put and port density, and are significantly more expensive than ToR switches. To make the network fabric cost-effective, higher layerlinksaretypically oversubscribedbyfactors of10:1to80:1, limitingthebandwidthbetweenservers indifferentbranches[4]. Linksinthesamerackarenotoversubscribedandthuscollocated servers canoperate atfull linkrate. Cross-rack communicationis routedthroughthehigherlayersofthetopologyandtherefore,in caseofpersistentandhigh-loadcommunicationbetweenracks,the aggregationandcoreswitchescanbecomecongestedandresultin highlatencyandpacketloss.Toincreasecapacity,networkopera- torsmustresolvetoverticalexpansion,inwhichoperatorsreplace overloadedswitcheswithhigher-cost,highercapacityones[15]. 2.2. Clos/Fat-treearchitecture

Modern datacentrearchitectures[6,16,17] havebeen proposed to reduce or even remove oversubscription altogether. Represen- tativework, such ase.g.,Clos-Tree [6]andFat-Tree[16],promote horizontalratherthanthetraditionalverticalexpansion.Insteadof replacinghigher-layer costlyswitches, networkoperators canadd inexpensive commodity switches to expand their network hori- zontally using a fat-tree topology (Clos topology forVL2). Dense interconnect in these new fabrics provides a larger number of redundant paths betweenany givensource and destination edge switch (i.e., rich equal cost path redundancyin contrast to only

(3)

Fig. 1. Switch-centric ((a) & (b)) and server-centric ((c) & (d)) Cloud data centre topologies.

2 equal cost path in the conventional DC architecture) meaning that oversubscriptionofthehigherlayerlinkscan besigniﬁcantly mitigated. In Clos-tree topology, as shown in Fig. 1b, links between thecorelayerswitches andthe aggregationlayerswitches form acomplete bipartite graph [6]. However, ToRs only connect to two aggregationswitches asinthe conventionaltree architecture. The limitingfactor forthe size ofa fat-tree fabric isdeter- minedbythenumberofportsonswitches.Fat-treeuniformlyuses k-port commodity switchesatall layers. Allswitches attheedge andaggregation layers are clusteredintok pods, each containing k switches (i.e.,k/2 switchesat eachlayer). In other words,edge layer switches have k/2 remaining ports to connect to k/2 hosts.

Similarly aggregation layer switches usethe remaining k/2 ports forconnecting(k/2)² k-portcoreswitches.Eventually,ak-aryfat- treecansupportuptok³/4hosts.

Clos/Fat-treearchitectureshaveseenanincreasingpopularityin modern data centres to achieve high performance andresiliency throughtheirabilitytoprovidebetterscalabilityandpathdiversity than conventional DCtopologies [18,19].However, these architectures requirehomogeneous switches,andlarge numbersof links.

Upgrading to these architectures in a legacy DC usually requires replacingmostexistingswitchesandcables.Suchradicalupgrades aretypicallyprohibitivelyexpensiveandtime-consuming[20].

2.3. Server-centricarchitecture

In server-centric DC architectures, servers are both end-hosts andrelaying nodesformulti-hopcommunications.Themostrep- resentativefabricsareBCube[21]andDCell[22].BothBCubeand DCell come with custom routing protocols to take advantage of topologicalproperties[23].As showninFig.1c,inDCell,aserver isconnectedtoanumberofserversinothercellsandtoaswitch initsowncell.Accordingto[22],ahigh-levelDCellisconstructed fromlow-levelDCells(DCell_k,k≥0)inarecursivemanner.DCell0, asshowninFig.1c,isthebuildingblocktoconstructlargerDCells.

Ithasnserversandamini-switch(n=4forDCell₀Fig.1c)andall

serversinDCell₀areconnectedtothemini-switch.Andthenlevel- 1DCell₁isconstructedusingn+1DCell₀.InDCell₁,eachDCell₀is connectedtoall the otherDCell₀swithone link. Andthisproce- dureisrepeatedtocreatehigherlevelDCells.

In comparison, a BCube₀ is n servers connected to an n- port switch. A BCube₁ is comprised of n BCube₀’s andn n-port switches [21].In BCube,as illustrated inFig. 1d,two servers are neighbours if they connect to the same switch. BCube names a server ina BCube_k using an address array a_ka_k₁...a₀(âi∈[0,n− 1],i∈[0,k])^. ^Two^servers âre ^neighbours îfând ônly îf^their âd- dress arrays differ by a single digit. That is, two neighbouring serversthatconnecttothesameleveliswitcharedifferentatthe ithdigit[21].

Aprominentcompetingadvantageofserver-centricarchitecture isthemanageability.SincetheentireDCfabricisbuiltfromservers andaminimalsetofnetworkswitches,onlyasingleteamofengi- neersisrequiredtomaintainandmanagethewholearchitecture.

Incontrast,multiple(internalandexternal)professionalteamsare neededformanagingvarious switches, thatare produced by different manufacturers, in switch-centric fabrics. Also, in a server- centricarchitecture, intelligencecan beplaced onservers forim- plementingin-networkservicessuchastraﬃcaggregation,caching as well as deep-packet inspection etc. However, a server-centric architecture is fundamentally different from traditional network designs and has been seen as an untrusted andcomplex to up- dateoption.Inordertopromoteserver-centric architectures,they should offermore signiﬁcant competingadvantagesincluding re- markablereductioninoveralldeploymentcost,improvementinse- curityandresilience.[24].

2.4.Managementtopology

DCnetworknodesexchangeconsiderablemanagement datain orderto configure and maintainthe network-wide topology,and consequently, management data intensifies as the topology be- comes denser. [25] reports that management traffic can account

(4)

toapproximately 5–10% ofthe bandwidth duringnormal operat- ing conditions.With the recent advent ofSoftware Defined Net- working(SDN)[26],aparadigmthatlogicallycentralisesandsep- aratesthenetwork controlfromthe dataplane, themanagement network istightly coupledwith the control plane as control de- cisionsmust be transmitted betweenthe switches and a central controller.Therequirementsofa managementnetworkarediffer- entfromthose ofthedata-carryingnetwork: management traffic issparseandmaintaininghighthroughputisnotcritical,however, itislatency-sensitiveandfailurescanbecriticaltotheproduction networkbehaviour.Threedifferenttypesofmanagementnetworks havebeencoveredintheliterature,thesimplestistomanage the networkin-band(IB).Inthisconfiguration,bothmanagement and production traffic share the same network. This allows management to be cost-effective butin case of over-utilised production networks,themanagementnetworkisalsohindered.Itispossible tomitigatehindrancetomanagement trafficfromproductiontraf- ficthroughQualityofService(QoS)enforcementtoprioritiseman- agementtrafficoverdatatraffic[27].Theotherapproachistohave alogicalorphysicalOut-of-Band(OOB) network.Ina logicalOOB network,thecoreofthenetworkisstill sharedbetweenmanage- mentandproductiontraffic,butlogicalisolationisachievedusing VLANsordedicatedOpenFlowforwardingrules[28].OpenFlowisa communicationprotocolandAPIprovidingaccesstotheforward- ingplane ofnetworkdevices;it isthemostwidelydeployedim- plementationoftheSDNparadigm.Inthiscase,eachswitch must haveadedicatedportformanagement,increasingthecostaspro- ductiontraffic have lessdedicatedports butlimitingpossible in- terferencebetweenproductionandmanagement traffic. However, such setup is still vulnerable to device failure or misconfigura- tion.Finally,aphysicalOOBnetworkcanbeusedinenvironments where the management network operation is critical, such as a SDNenvironmentwithoutgracefulfallbacktolearningswitchesor otherdistributedmechanisms,whenthecontrollerisunreachable.

Insuchenvironments,adifferentphysicalnetworkisdedicatedto managementoperations[25,29].

3. Serverresourcemanagement

The cost of servers in data centres can account up to 45% of runningcostperyear[4].Itisapparentthatachievinghighserver utilisationisofparamountimportanceinordertoincreaseReturn- on-Investment(RoI). However, serverutilisation inDCscan be as lowas10%[30] duetoover-provisioningasaresultofthedesire toprovisionforpeakdemand[31].

Achieving high serverutilisation is challenging. First,it is dif- ficultforDC operators andcustomerstoplan inadvance for“di- urnalusage patterns, unpredictable spikes inuser andtraffic demand, andevolving workloads” [9]. Second, it is very expensive, ifnotimpossible, forboth ofproviders andconsumers(whohave little control and choice [32]) to configure individual servers so thatfine-grainedresources,suchas,e.g.,CPU,memory,storageand network, perfectly match temporal application requirements due to the heterogeneity of servers (i.e., servers have different CPU, RAMandotherresourcecapacities)andthecomplexityassociated withcalculatingindividualresourcerequirementsfordifferentser- vices [33]. Third,increasing serverutilisation by scheduling mul- tipleservices onone physicalhostcan causesevere performance degradationduetoresource(e.g.CPU,Memory,StorageandI/Ope- ripheral)contention[34,34,35].Last,butnotleast,CloudDCopera- torsandserviceprovidersoftenneedtomeetstrictQoSguarantees through ServiceLevel Agreements (SLA). Meeting SLAs is crucial, sinceitgivesconfidencetocustomerstomovetheirICTinfrastruc- tureintotheCloudenvironments andheavypenaltiesarepaidby theprovideriftheSLAisnotmet.Typically,inordertomeetSLA

requirements, resources are over-provisioned to meet worst-case demand[36].

Serverconsolidation istheactivity ofclusteringorreassigning severalvirtualmachines(VMs)runningonunder-utilisedphysical servers intofewer hosts, andisused inDCs to improveresource utilisationandreduceoperationalexpenditure(OPEX).VMconsol- idation has been employed by DC operators to optimise diverse objectives, such as, e.g., server resource (CPU, RAM, net I/O) usage [37–39]andenergyeﬃciency[40–42],ortomeetSLArequire- mentswhichareoftenexpressedasCPUorresponsetimeguaran- tees[36,43].Mostservermanagementworkstake oneresourceas optimisation objective and treat other resource as constraints or jointlyconsidermultiple resourceandSLA constraints.Hence,for theeaseofdiscussion,theresearch worksarebroadlycategorised basedon their main optimisationobjectives inthe followingdis- cussions.

In fact, some production software such as VMware vSphere Distributed Resource Scheduler (DRS) [44], Microsoft Sys- tem Center Virtual Machine Manager (VMM) [45], and Citrix XenServer [46] offer VM consolidation as one of their major features.

3.1. TypesofVMconsolidation

Server consolidation can be broadlyclassiﬁed asstatic or dynamic.Instaticconsolidation,orinitialplacement,consolidational- gorithmstake historicalresourceutilisationasinputtopredictfu- tureresourceusetrendbasedonwhichVMsaremappedtophys- icalhosts[47,48].Onceinitialstaticconsolidationhastakenplace, VMassignmentsusuallyremainunchangedforextendedperiodsof time(e.g.,monthsorevenyears).Itisalsodoneoff-lineduetothe highcomplexityofconsolidationalgorithms.Staticconsolidationis idealforstaticworkloadasitcan achieveoptimality. Onthecon- trary,dynamicallocationisimplemented overshorttimescalesin responsetochangeinresourcedemandbyleveragingtheabilityto dolivemigrationofVMs.

Dynamic server consolidation is particularly useful for unpredictable workload in which prediction-based mechanisms fail to work. Dynamic consolidation is carried out periodically in shortertimescalethatstaticone toadaptto changesofwork demand[39,40,49–52].

3.2. Serverresource-awareconsolidationschemes

When the users’ demand changes, VMs can start competing for physical resources resulting in computation hotspots. Sand- piper [39] is a tool that detects andmitigates hotspots based on physicalmachineresourcessuchasCPU,networkandmemory.In order to detect hotspot,Sandpiper implements a monitoring and proﬁling engine that collects CPU, network and memory usage statisticsonVMsandatime-seriespredictiontechnique(whichre- liesontheauto-regressivefamilyofpredictors)topredictthelike- lihoodofhotspots.Themonitoringcanbeeitherunobtrusiveblack- boxmonitoring whichinfers CPU,network andmemoryusage of eachVMfromexternalobservation(atthehost)oramoreaggres- sivegray-boxmonitoringthatexplicitlyputsadaemoninsideeach VMtomonitor/measureresourceconsumedbyindividualVMs.For bothmonitoringapproaches,techniquesareemployedtoestimate thepeakresourceneeds.Upondetectionofhotspot,itisthemigra- tionmanager’sresponsibilitytocarryouthotspotmitigation.Since optimallydecidingwhichandwheretomigrateisaNP-hardmulti- dimensionalbinpackingproblem,themigrationmanageremploys aheuristictomigrateVMsfromoverloadedserverstounderloaded servers wheremigrationoverhead(i.e.,theamountofdatatrans- ferred) is minimised. The main drawback ofSandpiper is that it

(5)

onlyreactivelytriggersmigrationupondetectionofhotspotwithin theinfrastructureanddoesnotconsiderthemigrationoverhead.

Entropy [53]achievesoptimalVMconfigurationwhile alsoen- suresthateveryVMhasaccesstosufficientmemoryandallocated CPUshare.EntropyhasasensorthatperiodicallyprobesVMs’CPU usage andworkingstatus.Anychanges willtriggerareconfigura- tionprocessviamigration,whichconsistsofvirtualmachinepack- ing problem (VMPP) and virtual machine re-placement problem (VMRP).Constraintprogrammingisthenemployed tosolveVMPP andVMRPproblems.However, inorderto reducemigration deci- sion time, whenan optimal solutioncannot be computed within 1 min,the computationis abortedandthe currentbest resultis used. Entropy takes migration overhead into consideration when makingmigrationdecisions,however,itassumesthattheresource demandisknownandstaticovertime.

Similar to Sandpiper, ReCon[51] exploits servers’ historical re- sourceusagetodiscoverapplicationsthatcanbeconsolidatedand recommendsadynamicconsolidationplanthatcanbedeployedin amulti-clusterdatacentre.TheVMconsolidationisformulatedas an optimisationproblemwithmultipleconstraints.Thecostfunc- tion is deﬁnedasthe runningcost of a physicalserver, predom- inated by power consumption which is translated into CPU utilisation. Hence, the objective is to minimise the cost given a set ofVMsandconstraints.However,prediction-basedschemecanbe sub-optimalwhenresourcerequirementisdynamic.

In contrast to optimising resources as complete units in aforementioned works, multi-resource schedulable unit (MRSU) [54]breaksCPU,memory, storageandnetwork intosmallchunks totackleresource-overallocationproblematﬁnegranularity.MRSU ﬁrstlydetermineschedulableunitineachresourcedimensionand then compute the number of MRSUs needed for particular instances. MRSU allocation is a min-max problem and hence a weightedfairheuristicisproposedtosolvetheproblem.

3.3. Energy-awareconsolidationschemes

VirtualPower[55] isa pioneer work to lookinto serverpower management invirtualisedenvironments.When aserveris virtu- alisedandsharedamongguestVMs,itshardwarepowermanage- mentcannotfunctionproperlyduetodiverseandinconsistentvir- tual servers’activitiesunlessallvirtualservers agreeonthesame limitation, e.g.reducing memorybandwidth,concurrently.Onthe other hand,guestVMsseethemselvesasindependentserver and proactivelytry tomanage‘their powerstates’. Insteadofignoring thesebuilt-in powermanagement policies asdoneby hypervisor, VirtualPower exploits the policies as effective hints of individual VM’s power state. Therefore, VirtualPowercan provide a rich set of‘soft’ VirtualPowerManagement (VPM)statestoVMs andthen useVMs’statechangesrequestsasinputstomanagepowerlocally on individual physical server and globally (that considers maxi- mum powerconsumptiononall applicationsinclusterorrack or eventheentiredatacentrelevel).TheVPMstatesmayormaynot be actually supported by hardware butare a set of performance states for use by VMs application-speciﬁc management policies.

The actual powermanagement actions are carriedout by thein- frastructure are defined as VPM rules and are realised by VMP mechanismswhichincludehardwarescaling,softscaling,andcon- solidation. Differentfromother relatedworkto managepower at thelevelofphysicalhosts,VirtualPowerenablesfine-grainedpower controlatthelevelofindividualVMs.ThebiggestlimitationofVir- tualPower isthat significant modifications toexisting hypervisors mustbeperformed, preventingit’slarge-scale deploymentinex- istinginfrastructures.

WhileVirtualPoweronlyoptimisesenergyeﬃciencyonindivid- ual hosts,pMapper isa controllerthat places an applicationonto the mostappropriate physicalserver in orderto optimiseenergy

andmigrationcosts,whilestillmeetingsomeperformanceguaran- tees[41].pMapper employs FirstFitDecreasing (FFD)bin-packing algorithmtoselectanoptimalserverforanyapplicationbeingmi- grated in order to minimise power consumption. The algorithm optimises one major resource such asCPU utilisation and treats other resources such as memory and I/O as constraints. pMap- per has three main modules:The Monitoring enginemonitors all VMsandphysicalserversandcollectstheirperformance(different workloads contribute to theoverall CPU and memoryutilisation) andpower(overallusage)characteristics.ThePerformanceManager examinesperformancestatisticscollectedfrommonitoringengine, producesasetofVMsizesthatsuitcurrentloadingandestimate potential beneﬁts should any VM resizing is required.The Power Managerkeepstrackofcurrentpowerconsumptionandoptimises it through CPU throttling. Nevertheless,pMapper employs an Ar- bitratorto ensureconsistency betweenthe threemodules. Subse- quent works [33,56] focused on analysing real data centre application traces, and revealed that there are suﬃcient variations andcorrelationsamongst applicationstobeexploitedforimprov- ing power saving. Hence, pMapper has been extended to include someapplication-awarenessfeatures.

IncontrasttopMapper,Mistral[42]isasystemthatemphasises ontheoptimisationoftransientpower/performancecosts.InMis- tral, application performance is reﬂected in application response timethatismodelledasalayeredqueuingnetwork(LQN).Power consumptionofaconﬁgurationisbasedonanempiricalnon-linear model that concerns CPU utilisation (e.g., power consumption at idleandbusystates).DifferentfrompMapper,Mistraltakescostof sixtransientadaptationactionsintoaccountincluding:changesof aVM’sCPUcapacity,additionandremovalofaVM,livemigration ofaVM,shuttingdownandrestartingahost,changeinresponse timefortheapplicationsandchangeinpowerconsumptionduring theadaptation.Ituseaworkloadpredictortopredictthestability intervalsof next adaptation based on historical average resource usage.

3.4.SLA-Awareconsolidationschemes

AServiceLevelAgreement(SLA)isacontractthatsetstheex- pectations, usually in measurable terms, between the consumer andserviceprovider[57].InaCloudcomputingcontext,thereare infrastructureSLAandserviceSLA.InfrastructureSLAisestablished between infrastructure (IaaS) providers and service providers to guaranteesuﬃcientresourceanduptimewhilstserviceSLAises- tablished between service providers and their customers and is typically measured in QoS metrics such as application response time:forexample,maximumresponsetimeof100mswithmini- malthroughputof100transactionspersecond[57,58].SinceaSLA isthecornerstoneofhowtheserviceprovidersetsandmaintains commitmentstotheserviceconsumer,optimisingresourceutilisa- tionwhilenot violatingSLA isalsocruciallyimportantforopera- tors.

Bobroff et al.[36]propose a dynamicresource allocation algorithm forvirtualisedserver environments tomaximise theglobal utilisationof the datacentre, whilenot violating SLAs(i.e.,VM’s CPUtimeguarantee)asaresultofperformancedegradationdueto overloading.Similar tootherVMconsolidation schemes,thealgo- rithmcollectsandanalyseshistoricalusagedataonresourceutili- sationandservicequality andpredicts thefuturedemand, based on which a sequence migrations are computed. The algorithm (bin packing) is invoked periodically and thus forms a measure- forecast-remap(MFR) optimisationloop. Forplacement,the algo- rithmderivesaminimumsetofserversrequiredtoaccommodate allVMswhilenotoverloadingtheserverswithSLAconstraints,i.e., CPUtimeguarantee.Thiswork,however,sharesthesimilarlimita- tionfaced by prediction-based consolidation algorithms in which

(6)

resourcedemandandfutureserverresourcehavetobedetermin- istic.

Breitgand etal.[43] presenta ElasticServicePlacementProb- lem(ESPP). Since SLA is defined asmeeting the requirements of VMsizing, ESPPaims tooptimally allocatevariablesized VMs to physicalhosts.InESPP,each serviceismodelledasan application thatspansoverasetofVMs.Hence,theESPP’sgoalistomaximise theoverall profit, whichis measured withresultingVM size and probabilityofVMviolation.However, thisformsageneralizedas- signmentproblem(GAP)whichishardtosolve.Theauthorsrelate ESPPto multi-unit combinatorial auctions and hence provide an integerprogram formulationofESPPthat is amenableto column generation. While column generation is efficient for VM place- mentin small datacentres, it doesnot scale to large mega data centre.

3.5.Network-awareconsolidationschemes

Meng etal.[48] proposea pioneeringwork innetwork-aware initial VM placement. The authors first studied two sets of real traffic traces in operational data centres and observed three key traffic patterns that can be exploitedfor VM placement: (a) Un- evendistribution oftraffic volumesfrom VMs;(b)Stable per-VM trafficatlargetimescale;and(c)Weakcorrelationbetweentraffic rateandlatency. Hence, VMs can be placed ina waythat traffic islocalised andmanaged pairwise.In order toachieve theseob- jectives,theyformulatedaminimisationproblembasedonafixed costofrequiredbandwidthforeachVM,andthecostofcommuni- cationbetweentheVMs.CPUandmemoryresourcesarenotused inthe algorithmsince it is assumedthat each VM hasthe same sizeand that each server supports a fixed numberof VMs. They alsoshowed that the VM placement as an optimisation problem isNP-hard. To reduce the complexity ofthe proposed algorithm, theauthors use a scenario with constant traffic of grouped VMs tosimplifytheproblemspace.TheVMplacementproblemisthen solvedusingmin-cutandclusteringalgorithmstofinddivisionsfor VMs.

Wang etal.[50],asan extensionwithdynamictraﬃc to[48], propose a VM consolidation scheme to matchknown bandwidth demandstoserver’scapacitylimit.However, incontrasttoclassi- calBinPackingoptimisationinwhichnetworkbandwidthdemand isassumedtobestatic,theauthorsformulatedaNP-hardstochas- ticbin-packingproblemwhichmodelsthebandwidthdemandsof VMsasprobabilisticdistributionsandthensolveitusinganonline packing heuristic that assumes each bin has unit capacity. They alsoassumethat networkbandwidth isonlylimitedby hostnet- workdevicesratherthantopologicaloversubscription.Asonlyhost bandwidthlimitisconsidered,networkcapacityviolationsintothe DCarepossiblebecauseaggregationandcorelayerlinksareoften oversubscribed.

Ballani et al. [5] tackled the unpredictability of network performance (network-awareness) with novel virtual network abstractions through which tenant virtual networks are mapped to operator’s physical networks using an online algorithm. The virtual network abstractions include a virtual cluster represent- ing a topology that is comprised of a number of VMs, a non- oversubscribed virtual switch and a virtual oversubscribed cluster that reﬂects today’s typically oversubscribed two-tier cluster.

Once the mapping is done, it is enforced through work (VMs) placement with a fast allocation heuristic – given a set of VMs (withbandwidthrequirement)thatcanbeplacedinanysub-tree, the algorithm ﬁnds the smallest sub-tree that can ﬁt all tenant VMs.

Unlike[5,48,50] whichonly considernetwork bandwidth con- straint,Shrivastavaetal. [59]proposedaframeworkwhichjointly considers inter-VM dependencies and underlying network topol-

ogyforVMmigrationdecisions.Theobjectiveoftheoptimisation frameworkistominimisetheoverhead(latency,delay,ornumber ofhops)ofmigrationbyplacingdependentVMsincloseproximity intopologicallocation.However, theproblemisa variantofmul- tipleknapsackproblemandisthus NP-complete.Anapproximate algorithm,AppAware,isthusproposedinthepaper.

Biranetal.[47]describedaminimisationproblemtodetermine thelocation ofVMs inadatacentrebasedonthenetwork bandwidth, CPU,andmemory resources.Theirformulation iscomplex anddoesnot scaletothesizeofdatacentres,thustheyalsocre- atedtwoheuristicsbasedontheminimisationalgorithm.Thecal- culationismadeoff-line, andatevery changeit needstobeexe- cuted.Theyassume thateach userhasa speciﬁcnumberof VMs thatcanonlytalktoeachother,thereforealltheVMsarealready clusteredintheirapproach.

As computationcontinuesto shiftfromon-premisesIT infrastructure into the Cloud, the computing platform now resides in a warehouse hosting (hundreds of) thousands of physical hosts.

Today’sCloud DCs are no longer places that house a large number of co-located servers, rather, they can be seen as a massive warehouse-scale computer [30]. The underpinning DC infrastructures are still a large-scale distributed systemandtherefore, one should consider converged,DC-wideresource optimisation rather than per-node-based optimisation. In particular, asdata are con- stantlyshuﬄedacrossthenetwork,performanceisultimatelylim- ited by the network’s aggregate capacity asbandwidth is an ex- pensiveresourceandishighlyoversubscribed. Itisthereforecru- cialthatanyresourcemanagementandoptimisationschemetakes thenetworkperformanceintoconsideration[13,47,48,50,59–61].

Mostofoptimisationframeworkseitherrely ontheprediction offuture trendsbased onhistorical data,andproactivelyallocate resourcebasedon predicteddemand.Some employdirectlymea- sured metrics ofinterest, such as[60,61], anddynamically adapt tochanges accordingtomeasurement results.Duetoﬂuctuations inuserdemands[6,33,36],directmeasurementoftemporalresource usageseemsmoreappropriatethanpredictionmodelsintermsofex- ploitingresourceavailabilityinshorttimescales.

3.6. Openresearchissues

Todemonstratetheflexibility ofdirectmeasurement,we have developed S-CORE [60–62], a distributedcommunication cost reduction VM migration approach which takes network cost into account. As opposed to aforementioned works in network aware servermanagement,S-COREemploysadistributedalgorithmbased on information available locally through direct measurement of bytes exchanged at flow level at each VM to perform migration decisions, rather than using in-network or global statistics. This property allowsthe algorithm toscale andbe realisticallyimple- mentableoverlarge-scaleDCinfrastructures.Ititerativelylocalises pairwise VM traffic to lower-layerlinks where bandwidth is not asoversubscribedasit isin thecore, andwhere interconnection switchesarecheapertoupgrade.

Experimental results show that, by directly measuring traffic demand betweenVMs, S-CORE can achieve significant (up to 87%) communicationcost reduction, as shown inFig. 2. The fig- urealsohighlights thatS-CORE,whenorchestrated withtopology awareness(VMswhosetraffic loadis routedthroughthehighest- layer links of the network topology are prioritised over close- mindedVMs),convergessignificantlyfasterthanwhenatopology- agnostic round-robin orchestration scheme is used. This demon- stratesthat thespatialgranularity,i.e., theflowleveldirecttraffic measurement, provides useful instantaneous network knowledge that helpsimprovedecisionmaking (reflectedintheconvergence time).

(7)

0 100 200 300 400 500 600 700 1.4

1.6 1.8 2 2.2

Time(s)

Communication Cost Ratio

Highest Level First Round Robin

Fig. 2. Granularity of spatial traﬃc demand measurement in S-CORE [60] has large impact on the performance.

4. Networkresourcemanagement

For the majority of applications hosted over Cloud environments(e.g.,web-indexing,distributeddataanalysis,videoprocess- ing, scientiﬁc computing), data is continuously transmitted over thenetworktosupportdistributedprocessingandstorageaswell as server-to-server communication [63]. These data-intensive or latency-sensitiveapplicationsareparticularlyvulnerabletovolatile throughput andpacket. Yet, the increasedoversubscription ratios frombottom tothetop ofprominentmulti-roottreenetwork ar- chitecturescanresultinpoorserver-to-serverconnectivityhinder- ingapplicationperformance[6,16].

Research has demonstrated that supporting protocols have failed to leverage topological advantages of new “scale-out”

architectures [64,65]. Most notably, recent measurement work [8,9,66] suggests that current DC networks are largely under-utilised and therefore there is signiﬁcant room (i.e., up to 20% of network capacity [67]) for operators to improve performance beforeconsidering expanding their network infrastructure or upgrading to new fabricsif provisioning is reinforced with a ﬁner-grained control loop. Resource fragmentation can become a performance barrierin DC,resulting inlow serverutilisation and thereforelowerRoI[4,6,9,16].

Fine-grainednetworkresourceprovisioningrequiresknowledge oftheinstantaneoustraﬃcdemands,andsubsequentharnessingof intelligentresourceadmissioncontrolaswellasexploitingtherich pathredundancyoftheunderlyingDCnetwork. However,achieving such provisioning using existing legacy mechanisms is faced with two fundamental challenges: First, estimatingnetwork load based on historical traﬃc demands (i.e., predictions) is dubious, since thesechangerapidlyinDCenvironments anddifferentpat- terns emergeoverdiversetimescales[9].Second,existing routing protocolssuchasECMPfailtosupportdynamicapplicationssince they are load-agnostic and operatesolely on packet header con- tents[8,15].

4.1. DCtraﬃccharacteristics

Havingabetterunderstandingoftrafficpatternscanhelpinde- visingmore intelligent traffic management schemes thatimprove networkperformance.Anumberofstudiessuchas[8,9,68,69]have looked into Cloud DC traffic patterns revealing some unique in- sights.

In a DC network, ToR Traffic Matrices (TM)s are sparse with significantlocality characteristics,sinceafewToRs exchangemost data with justfew other ToRs [68]. Although a significant frac- tionoftrafficappearstobelocalisedinsidearack,congestiondoes

Fig. 3. Traﬃc engineering is a procedure that optimises network resource utilisa- tion through reshaping of network traﬃc.

occur invarious layers ofthe infrastructuredespite sufficient capacity being available elsewhere that could be used to alleviate hotspots[8].Congestion,whenithappens,isshowntodeteriorate applicationperformance byreducingserver-to-serverI/Othrough- put[9].In terms offlow distribution characteristics, datamining andwebserviceDCsmostlyaccommodatesmall(mice)flowstyp- ically completed within 1s. Flow inter-arrivaltimes varyfrom 1 flowper15msto100flowspermillisecondatserversandTop-of- Rackswitches,respectively,whileonaverage,thereare10concur- rentflows perserveractive atanygiventime[6,9,69].Finally,DC traffic patterns change rapidly and maintain their unpredictabil- ityovermultipletimescales (asopposedtolegacyInternet workloads),mainlyduetotheunpredictabledynamicsofexternaluser requests asa resultof resource sharing, andthe multiplexing of traffic atthe level ofindividual flows, asopposed tolarge traffic aggregates[6].

ManyCloudapplicationsfollowPartition/Aggregatedesignpat- terns in which application requests are divided into a number of smaller tasks which are then distributed to a set of work- ers (servers). The intermediate results yielded from these work- ers are aggregated to produce a final result. As a result, DCs mainlyrunhostapplicationswithamulti-layerpartition/aggregate pattern workflow which exhibits pronounced Partition/Aggregate traffic patterns which exhibit bursty traffic patterns, resulting in ThroughputIncastCollapse[70–73].

4.2.DCtraﬃcengineering

Traffic engineering (TE) is a technique used by ISPs to select routesthat makeefficientuseofnetworkresources.Morespecifi- callyTEisaprocedurethatoptimisesnetworkresourceutilisation throughreshapingofnetworktraffic.Fig.3illustratesatypicaltraf- ficengineeringprocedureandobjectivesthatarecommonlyused.

TEconsistsofacontrolloopthatcontinuouslymonitorsandevalu- atesmetricsofinterest,basedonwhichoptimalresourceschedul- ingiscomputedanddeployed.TE techniquescanbebroadlyclas- sifiedasonlineandoffline,themaindistinctionbetweenthetwo beingthe timescalesatwhich objective values,such as, e.g., link weightsandschedulingoftrafficflowsareadjusted.

Equal Cost Multipath (ECMP): In today’s data centres, Equal Cost Multipath (ECMP) is the most commonly used routing to spreadtrafficflows acrossredundantshortestpaths usinghashing onflowtuples(i.e.,attributesofpacketheaders).

ECMPiseasy toimplementasitstaticallyhashesoneormore tuplesofpacket headers andsubsequentlyschedules ﬂows based on their hashed values, ensuring that packets of the same ﬂow are all scheduled over the samepath. A commonlyused 5-tuple

(8)

hashing is based on protocol identiﬁer, source/destination network address,andsource/destinationport.

ECMPchallenges:RecentresearchhasshownthatECMPfailsto efficientlyleveragepathredundancyinDCnetworks.Studieshave demonstrated that network redundancy cannot completely mask all failures, implicitly pointing to the inefficiency of ECMP [66]. Similarly,itisshownthat ECMP’sstatichashing doesnottake ei- ther current network utilisation or flow size into consideration.

Such hashing causes flow collisions that saturates switch buffers anddeterioratesoverallswitchutilisation,resultinginreductionin thenetwork’sbandwidth [15].Moreover,MicroTE[10]hastested ECMP with real DC traffic traces and found that ECMP achieves only80–85%ofoptimalperformancethatcanbeobtainedbysolv- ing alinear program withthe objectiveof minimisingthe Maxi- mumLinkUtilisation(MLU),assumingfullpriorknowledgeofthe trafficmatrixeverysecond.Theimplicationofsuchinefficiency is that,while mostofthelinksin measuredDC networkshaverel- ativelylow utilisation,asmallbutsignificant fractionoflinksap- pearto bepersistentlycongested [8,9].As aresult, operators will needtoupgrade theirnetworkseven ifthey are generallyunder- utilised.

4.3.Utilisation-awaretraﬃcengineering

Hedera [15] is a centralised TE mechanism aiming to resolve ECMP’s inability to fully utilise network bandwidth. Hedera’s is comprisedofthree steps. First,large flows (exceeding10% ofthe host-NIC bandwidth) detection and scheduling is carried out at theedgeswitches.MiceflowsarestilladmittedusingECMP.Next, it estimates the naturaldemand, which is defined as the rateit wouldgrowtoinafullynon-blockingnetwork,oflargeTCPflows.

Basedonthedemandmatrix,Hederauseseitherglobalfitorsimu- latedannealingheuristicplacementalgorithmstofindbestappro- priatepaths fordifferentflows. Eventually,thesecomputedpaths arepushed ontotheswitches.Incontrasttoonlyschedulinglarge flows,VL2[6]usesValiantLoadBalancingtorandomisepacketfor- wardingonaper-flowbasis.InVL2,twotypesofIPsareemployed.

All switches and interfaces are assigned location-specific and ap- plicationsuse application-specific IP addresses, which remain un- changed regardless of server locations as a result of VM migrations. Since each server randomly selects a path for each of the flowthroughthenetwork,itsharesintrinsictraffic-agnosticnature ofECMP.

MicroTE[10]isafine-grainedTEapproachforDCsthatachieves traffic adaptation by exploiting the short-term and partial pre- dictabilityofthe DCtraffic demands, to attain overall betterlink utilisation than ECMP. MicroTE has a centralised controller to gathernetworkdemandsfromthenetworkandmaintainsaglobal view of network conditions. A bin-packingheuristic is then em- ployedtofindminimumcostpathforagivensetof(stable)traffic demands.However, unpredictablenatureoftraffic patterninpro- ductionDCs[6]putsMicroTE’susabilityunderquestion.

The Modified Penalizing Exponential Flow-spliTing (MPEFT) [67] implemented and evaluated an online version of PEFT [74] to provide close to optimal TE for a variety of DC topologiesbybothshortestandnon-shortestpathswithexponen- tialpenalisation. MPEFT implements a hardwarecomponent ina switchtoactivelygathertrafficstatisticsandlinkutilisationwhich are then aggregated to a traffic optimiser. Similar to MicroTE, MPEFTisanonlineTEthatoptimisesnetworkresourceutilisation in short timescales. Different from other schemes, MPEFT does not relyon staticpredictions of traffic demands which isproven to be unreliable [9]. Rather, MPEFT monitors traffic demands in orderto capturetemporal traffic variabilityand then recomputes and schedules traffic to adapt to such variance. However, near

optimality is achieved only when per-packetbased scheduling is used.

4.4. Energy-awaretraﬃcengineering

CloudDCsareamongstmajor consumersofelectricityandthe trendissetforittoriseevenhigher.Itisestimatedthat amongst eachWatt consumed,IT equipmenttakesabout59%oftheshare, 33% is attributed to cooling, and 8% is due to power distribu- tionloss[4].Inordertoreducetheenergyconsumedbynetwork equipment,energy-awareroutinghasbeenproposedusingpathdi- versitytoconserveenergy.Forexample,someschemesuseasfew network devices aspossible to provide the routing servicewith- outcompromisingnetworkperformance [75].Once,theminimum requiredsetofnetworkingnodeshasbeenestablished,remaining idleonescanbeshut-down orputtosleepmodetosaveenergy.

However,ifthefault-toleranceisnotconsidered,thisapproachcan decreasetheresiliencyofthenetworkunderfailure.

ElasticTree [76]is such an optimiser. It continuously monitors the DC’straffic conditionsandthen determines a setof network elementsthatmustbepoweredontomeetperformanceandfault tolerancerequirements;Switchesorindividualports/linksthatare not needed can be shutdown. ElasticTree consists of three logi- cal modules:optimiser,routing, andpowercontrol. Theoptimiser takes the topology, traffic matrix and a power model as well as thefaulttoleranceproperties(e.g.sparecapacity)asinputstofind minimumsetofnetworkthatmeetscurrenttrafficconditions.The outputoftheoptimiser isasetofactivecomponentsto boththe powercontrol andtherouting modules. The powercontrol isre- sponsiblefortogglingthepowerstatesofswitches,portsandline cards.Theroutingisresponsibleforflowadmissionandinstallsthe computedroutesintothenetwork.

4.5. Latency-awaretraﬃcengineering

High-bandwidth Ultra-LowLatency (HULL) [77] isan architecture that is designed for delivering predictable ultra-low latency andhigh bandwidthutilisation in aDC environment. Inorder to achieve thisgoal,HULLuses acombinationofthree technique:It uses PhantomQueues,which simulatethe occupancy ofa queue thatdrainsatlessthanthemaximumlinkrate,adaptiveresponse toECNmarksusingDCTCP [71],andpacketpacingtosmooth out burstsofpacketarrivals.Frombothtestbedandsimulationexperi- ments,itisreportedthatHULLmitigatestaillatencybyafactorof up to10–40% through tradingoff network workthroughput [77]. In other words,HULL doesnot eliminate queuingdelay, butpre- ventsitfrombuildingup.

Preemptive Distributed Quick(PDQ) [78] flow scheduling is a networkprotocoldesignedtoimproveflowcompletiontimeinor- der to meetdeadlines.PDQ borrows some key ideasfromlegacy real-timescheduling:useEarliestDeadlineFirst toscheduletasks if they need to meet deadlines or use ShortestJob First if flow completiontimeisofhigherpriority.PDQconsistsofaPDQsender andaPDQ receiver.APDQsendersends aSYN packetto initialise anewflow andaTERM packettoterminate aflow; itisalso for retransmittingapacketifatimeoutoccurs.WhereasaPDQreceiver extractsthePDQschedulingheaderfromeachdatapackettoACK packets.However,sincePDQschedulingisaprotocolthatisfunda- mentallydifferent fromstandard protocols existing inproduction switches,it can onlywork withcustom-made PDQswitches. The PDQ switches share a common flow comparator, which assesses flowcriticalityinordertoapproximatearangeofschedulingdisci- plines[78].PDQrequiresswitchestoperformexplicitratecontrol andflowstatemaintenance,andhenceiscomplextoimplement.

DeTail [69] isa cross-layerschemeforcutting thetail offlow completion faced by DC network traffic flows. At the link layer,

(9)

DeTailemploysflowcontroltomanageportbufferoccupanciesand create a loss-less fabric. Each switch in the network individually detects congestionby monitoringingressqueueoccupancy which isrepresented withadrainbytecounters.When itexceeds apre- definedthreshold,theswitchinformstheprevioushoptopauseit transmissionbysendingaPausemessagewiththespecifiedprior- ities.Similarly,whenthedrainbytecountersfallsbelowtheprede- fined threshold, the switch resume transmission by sending Un- pause message to the previous hop. DeTail employs congestion- basedloadbalancingatthenetworklayerbyadmittingflowsonto leastcongested shortestpaths. IncomparisontoPDQ,DeTailonly cutsthetailofflowcompletiontimerathercuttingmeancomple- tiontime.

Fastpass[79] isalogicallycentralisedarbiterwhichallowsend hoststosendatline-ratewhileeliminatingcongestionatswitches.

This isachievedby taking packetforwarding decisionout of end hosts andcarefully schedule all flows in a time-sharing fashion, such thateach hostsgetsa smallfractionoftimetousethe network exclusively. The centralised arbiter also consists of a path selection that scatters packets acrossall available links such that queues will not build up. In comparison with PDQ, and DeTail, Fastpass doesnot require hardwaremodification, butneeds high precision clock synchronisation and will increase the mean flow completion time due to the communication delay withthe con- trollerforeverypacketintheflow.

Silo [80]providecloud applicationsguaranteednetwork bandwidth,guaranteedpacketdelayandguaranteedburstallowancein order to ensure predictable network latency for their messages.

Silo employs network calculus to map such multi-dimensional network guarantees to queuing constraints on network switches.

Compared with other systems, Silo doesnot requires substantial changestohostsornetworkswitches,andhenceisreadilydeploy- able. However, Silo still relies on the predictability of future de- mandandmakestaticallocationofbandwidthshare.

4.6. Policy-awaretraﬃcengineering

All networks,includingdata centrenetworks, aregoverned by network policies. Network policy management research to date haseitherfocusedondevisingnewpolicy-basedrouting/switching mechanisms orleveraging Software-DeﬁnedNetworking (SDN)to manage network policies and guarantee their correctness. Joseph etal.[81]proposedPLayer,apolicy-awareswitchinglayerforDCs consisting of inter-connected policy-aware switches (pswitches).

Middleboxes are placed off the network path by plugging them into pswitches inPLayer. Based on policiesspecified by adminis- trators, pswitches can explicitly forward differenttypes of traffic through differentsequences ofmiddleboxes. PLayeris efficientin enforcingnetworkpoliciesbutitdoesnotconsiderloadbalancing whichiswidelyusedintoday’sdatacentres.

Vyas et al. [82] proposed a middlebox architecture, CoMb, to actively consolidate middlebox features and improve middlebox utilization, reducing the numberof requiredmiddleboxes forop- erational environments. Policy-Aware Application Cloud Embed- ding (PACE) [83] isa framework to support application-wide,in- network policies,and other realistic requirements such as bandwidth and reliability. However, these proposals are not fully de- signedwithVMsmigrationinconsideration,andmayputmigrated VMsontheriskofpolicyviolationandperformancedegradation.

Recent developments in SDN enable more flexible middlebox deployments over the network while still ensuring that specific subsets of traffic traverse the desired set of middleboxes [84]. Kazemian et al. [85] presented NetPlumber, a real-time policy- checking tool with sub-millisecond average run-time per rule update, and evaluated it on three production networks includ- ing Google’s SDN, the Stanford backbone and Internet2. Zafar

et al. [86] proposed SIMPLE, a SDN-based policy enforcement scheme to steer DC traﬃc in accordance to policy requirements.

Similarly, Fayazbakhsh etal. presented FlowTags [87] to leverage SDN’sglobalnetworkvisibilityandguaranteecorrectnessofpolicy enforcement. While these proposals consider policy enforcement aswell astraﬃcdynamism, they requiresigniﬁcantnetwork sta- tusupdateswhenVMmigrationshappen.

SYNC [88] and PLAN [89] study the impact of correct policy implementationinthedynamicVMmigrationenvironmentwhere changeofend-pointcouldimplyviolationofnetworkpolicies.Both schemesovercomethediﬃcultybyjointlyconsideringnetworkde- mand of VMs and policy chaining requirement which demands speciﬁc network paths. The problemwas modelled asa NP-hard stablematching problem. Scalable andfast online heuristic algo- rithmshavebeenproposedtoapproximateoptimalsolution.

4.7.Openresearchissues

Most TE approaches and schemes discussed in this section shareacommonoverallobjective:toprovidepredictableandhigh- bandwidth network under highly variable traﬃc demands while alsomeetingother criteriasuchas,e.g.,energyconsumptionmin- imisation.Thecommonunderlyingcontrol loop includesmonitor- ing,detecting,andadaptingpromptlytoproblematiclinkload,pro- vidinga modelthat reactsto adverse conditionssuch asconges- tion.

Thetransientloadimbalanceinducedbyload-agnosticflowad- missioncansignificantlyaffectotherflowsusingaheavily-utilised linkthatiscommontobothroutes.Flowscontendingfortheband- widthofthesharedlinkaremorelikelytocreatecongestionwhich in turn causes packet drops for flows sharing the same bottleneck link. In most TCP implementations, packet loss will trigger packet retransmission when the retransmission timer expires or when fast-retransmit conditions are met. This additional latency can be a primary contributor to degradation of network perfor- mancesincetheretransmissiontimeoutisafactorof100ormore thantheroundtriptimeoveraDCnetworkenvironment.

Trafficflowsareusuallyshuffledovershortestpathsbetweencom- municating hosts. In some cases, however, selecting a non-shortest pathcanbeadvantageousforavoiding congestionorroutingaround a faulty path/node [67]. The surveyed proposals in this section onlyusemultipleequalcost pathinDCenvironment. Incompar- ison,Baatdaat [90]andMPEFT[67] opportunisticallyincludenon- shortestpathsforpacketforwarding.However,findingflowroutes in a general network while not exceeding the capacity of any link is the multi-commodity flow problem which is NP-complete forinteger number offlows. Hence, therouting algorithm might consider non-shortestpaths constrained by no more than n hops longerthan theshortestpathinpracticebecause itdoesnotsig- nificantlyincreasecomputationcomplexity[90].

The performance of current DC networkscan be significantly improved if traffic flows can be adequately managed to avoid congestion on bottleneck links. This can be achieved by em- ploying more elegant TE to offload traffic from congested links onto spare onesand alleviate the need fortopological upgrades.

Measurement-based traffic engineering techniques such as Baat- daat[90]andMPEFT[67]canplayanessentialroleinresponseto theimmediateloadfluctuations. Incontrasttoreactive trafficen- gineeringsuchasMicroTE[10],Baatdaatemploysameasure-avoid- adaptproactivecontrolschemebasedonnetworkprogrammability.

Baatdaatusesdirectmeasurementoflinkutilisationthroughded- icated switch-local hardwaremodules, andconstructsa network- wide view oftemporal bandwidth utilisation in short timescales through centralised SDN controllers. Subsequently, it schedules ﬂowsoverbothshortestandnon-shortestpaths tofurtherexploit pathredundancyandsparenetworkcapacityintheDC.Itisshown

(10)

Fig. 4. Granularity of temporal link load measurement (extracted from [90] ) has large impact on the performance.

in [90]that direct measurementof link utilisation canhelp make betterﬂow schedulingdecisionswhich resultin considerableim- provementin maximumlink utilisation. We reproducedin Fig. 4 theexperimentalresultsunderdifferentsettingsinBaatdaat.Itcan beseenthatdifferentmeasurement(andcontrol)intervalscanre- sultindistinctivelydifferentperformanceresults–Baatdaat’sper- formancegain overECMPvarieswiththemeasurementtimescale, andﬁner granularityyields betterimprovement.Eventhoughthe improvementis notuniform,insome regionscan reach20%over ECMP,while thepractical measurementoverheadisvery low,es- peciallyifadedicatedmanagementtopologyisused.

5. End-to-endﬂowcontrolandmanagement

TCPiscurrentlythemostwidely-usedtransportprotocolcarry- ing about85% of the traﬃc on the Internet [91] and over Cloud DCs. Originally, TCP was designed for long-distance, Wide Area Network(WAN)communicationwithrelativelylonglatenciesand low bandwidth, however, DC characteristics are signiﬁcantly dif- ferentwith Round Trip Times(RTT) below250

μ

^s, ^high^through-

putand a singleadministrative authority. Under thesecharacter- istics,TCP isknown to under-utilise network bandwidth, leading insome casestolow throughputandhighlatency[92,93].Toim- prove throughput, large buffers have been used throughout the network reducing the number of retransmissions, however large bufferscause side effects such as long latencies, traﬃc synchro- nisationaswellaspreventingcongestionavoidancealgorithmsto react promptly to congestionevents, leadingto buffer-bloat [94]. TCPvariants have been proposed to enhance network utilisation over DC environments, however supporting existing applications, workloads,andkeepingthedeploymentcomplexitylowprovesto beachallengingtask.

5.1.Transportprotocolsfordatacentrenetworks

MosttypicalDCworkloadssuchassearchengines,datamining ordistributedfilesystems,followthepartition-aggregateparadigm where the work is distributed amongst multiple machines and once each machine has computed a partial result, this is aggre- gatedback into asingle point[8].DC trafficgenerates two types of flows: mice flows that represent 99% of the flows, are small insize (less than a megabyte) and delay-sensitive;and elephant flowsofaggregatedatacarryingmostofthebytesovertheDCnet- work.Theselargeflowsarethroughput-sensitive,boundbyoverall longcompletiontimes[8,71].Whilemiceflowsarecreatedbythe query-response mechanism of the partition-aggregate paradigm, elephant flows come from synchronisation mechanisms such as distributedfile systemreplication,databaseupdates, andVM im- agemigration.

One of the issues is that TCP’s conservative nature requires a constant value for the Initial Window (IW) that cannot match the different environment requirements from WAN to DC. If IW is smaller than the congestion window during congestionavoid- ancephase,newflowswillunder-utilisethelinkuntilenoughRTTs haveelapsedandthebottleneckcapacityhasbeenreached,orthe flowwillterminatebeforeexitingtheslow-startphase.Overafast DCnetwork withlow latencies, theIW canovershoot the bottle- neckcapacity,triggeringpacketlossandunfairnesstootherflows.

Partition-aggregatepatternsgenerateburstsofON-OFFtraﬃcthat cancausepackets tobe dropped ordelayed[8].The conservative TCP parameters will wait some time before a packet is retrans- mitted,however, thetime forretransmissioncan be too longfor apackettomeetitsdeadline.

TCPhas been shown to havesignificant issues inDCs mostly becauseofitscongestionavoidancemechanism.Thelowthrough- putunderburstsofflows andmany-to-onecommunicationisre- ferred toasThroughput IncastCollapse [70,72,92].Duetothisis- sue,FacebookreportedlyswitchedtoUDPinordertohavetighter controloverthecongestionmechanismsandavoidtheadverseim- pactofTCPonachievablethroughput[71].Facebookimplemented a UDP sliding window mechanism, with a window size inverse proportional to the number of concurrent flows to solve incast collapsewhenreceivingmemcacheresponses,halvingtherequest time [95].In orderto fully utilisethe DCnetwork infrastructure, newprotocolshavebeendesignedtoprovidebetter performance, maximisethethroughput,minimiselatencyorreducequeuebuild- up.

A TCP enhancement, GIP [72], has recently been proposed to remedyTCPincastthroughputcollapse.Ithasbeenidentiﬁedthat two types of time-outs (TOs) termed Full window Loss TimeOut (FLoss-TO) and Lack of ACKs (LAck-TO) are the major TOs that causeTCPincastproblem.ToavoidtheseTOs,GIPreducesthecon- gestionwindowatthe startofeach traﬃcburst(stripeunit)and retransmits the last packet of every stripe unit for up to three times. However, GIP does not deal with TOs due to packet loss andin high speed environment withsmall buffers, the extra re- transmissionof thelast packetcan needlesslyincrease buffer occupancy.

DataCenterTCP(DCTCP)[71]leveragesECNinmodernDCsto provide a multi-bit feedback froma single-bit stream. Instead of treatingeachECN-markedpacketasacongestionevent,itusesthe fractionof markedpackets to pace thesending rate. Indoing so, thepresenceandextentofcongestioncanbeestimated.Themain concept ofthisapproachis tokeep thebuffer occupancy ofeach switch aslowaspossibleto preventnewpackets frombeingde- layed.DCTCPrequiressupportfromthekernelinboththesending andreceiving hosts as well asActive Queue Management (AQM) withECNsupportinthe switches.DCTCPhasshownto bea sig- niﬁcantstep forwardinpreventingthroughputcollapse, however, itstill reactstoan occurringcongestioneventinsteadofprevent- ing such congestiontohappen. DCTCP hasnowbeen includedin WindowsServer 2012andLinux3.18asan alternativecongestion controlalgorithm.

D³ attemptstotreataDCasasoftreal-timesystem,witheach flow havingdeadline requirements witha revenue loss ifitdoes not. It requires a new protocol that uses explicit ratecontrol to apportionbandwidthaccordingtoflowdeadlines.Tocalculatethe rateoftransmission,D³ measures thenumberofflows traversing the interface using flow initiation and termination packets (SYN andFIN)[96].D³, isbuilton topofDCTCP. However, ithasbeen shownin[97]thatD³ canmakeunfairbandwidthallocations.

TCP variants have been proposed to avoid queue build-up and therefore prevent high latencies. The Rate Control Protocol (RCP) [98] and the Variable-Structure congestion Control Proto- col (VCP) [99] aim to estimate link congestion and avoid queue

(11)

build-up,minimiseﬂowcompletiontimewhilebeingTCP-friendly.

However, both these protocols require end-host and switch support. For long-distance, high-latency environments a signiﬁcant number of protocols such as STCP, Fast TCP and XCP have been proposed. Theseprotocols haveopposite requirements towhat is requiredinaDC.XCPusesageneralisationofECNtohaveexplicit feedbackinsteadofusingpacketdropsorthebinarymechanismof ECN.FastTCPestimate thebaseRTTofthenetworkandusesthis value aswellascurrentRTTtoestimatethecurrentlength ofthe buffers, the sending rate is adjusted withrespect to thenumber ofpackets inthequeues.Theseprotocolshavebeen optimisedto achieve high throughputfor long-livedﬂows over Long Fat Pipes (LFP).

One recent proposal to tackle latency is TIMELY [100], which reconsiders the applicabilityof (round triptime) RTTto estimate queue occupancy. RTT has not been considered in previous proposals becauseitis pronetonoise such assysteminterrupts and processinginOSstack,givensmall– tensof microseconds– endto enddelayindatacentreenvironment.TIMELYovercomesthislim- itationbyusingnewlyavailablehardware-assistedNICtimestamps tobypass OSstack.Oncemeasured,TIMELYwillcompute thede- lay gradient,which reﬂects howquickly thequeue isbuilding or draining,anduseittocomputetargetsendingrate.TIMELY,onthe other hand, requires ﬁne-grainedandhigh precision RTTestima- tion.

The issue with TCP is that it is complicated to keep high throughput withlow buffer occupancy [71].Without bloatingin- network buffers,the endhostsmustbe ableto pacethe delivery ofpackets tomatchthecharacteristics ofthelink.However, such characteristicsarecommonlyunknownbytheendhostsandvary dependingon the numberof connectedhosts andactive concur- rentﬂows.Inordertoestablishthatapackethasbeendroppedor lost,somealgorithmssuchasF-RTOcanbeusedbutthelastresort istousetimeouts.Suchtimersmustbelongenoughnottoworsen congestionby duplicatingpackets,butalsoshortenoughtoavoid longdelaysbetweentransmissionandthereforelowthroughput.

The current trend in using commodity instead of DC-speciﬁc hardwareshows that it isunlikely that application-speciﬁc hard- warewillbedeployedinsuchinfrastructures,hencealgorithmsre- quiringtopologyorhardwarechangesareunlikelytobe deployed inproductionenvironments.

5.2. Openresearchissues

DCproviders havefull control overtheir infrastructure,allow- ing full network-wide knowledge ofthe topology, bandwidth,latency, and network element properties(e.g., switch buffer sizes).

Therefore, the default conservative valuesused to cope withthe unknown characteristics of the Internet can be alteredto match the network properties. Recently, with the wide deployment of Software Defined Networking (SDN) especially within DC networks [101], the current state of the network can be aggregated at a singleor a hierarchyof controllers, andsubsequently beusedtodistributenetworkknowledgetotheendhostsinshort timescales [28]. Amazon,Google, andMicrosoft showeda lossin revenuewhen responsetimeincreasedby100ms, creatinga soft real-timeconstraintonthemiceflows [102].Becausemiceflows aredelay-sensitive,itisnecessarytopreventthebufferoccupancy oftheswitchestogrowtoolargeundertrafficburstsasnewflows willbedelayedsignificantly.DuetoinefficiencyofTCPinDCnet- works, surveyedproposals concentrated on designing an alterna- tive congestion control for TCP. In comparison, Omniscient TCP (OTCP)[73,103]tacklesthisby exploitingSDNtotuneTCPparam- etersfortheoperatingenvironment.

WithSDN, theTCPparameters canbe tunedinreal-time with respect to the current network state and prevent buffers from

queuinguptoomuchdata.Iftheintra-DCBandwidth DelayProd- uct (BDP) is knownalongside the numberof flows on each link, theinitialcongestionwindow(IW)canbeaccuratelycalculatedto matchthetemporalnetworkpropertiesandincreasenetwork-wide throughput.EachflowIWhasafairsliceofthenetworkcharacter- istics,andthetotalnumberofon-the-flypacketsmatchestheBDP ofthenetworkwithnobufferoccupancy.TuningIWbasedonthe end-to-endintra-DCBDP,theamountofon-the-flypacketscanbe reducedtomatchthelinkpropertiesandhencereducebuffering.

Such approach also preventsundershoot or overshoot of the IW sizethatcaninturnleadtoalongslow-startphaseorpacketloss intheﬁrsttransmission,respectively.

Reducing in-network buffering and shortening slow-start will decreasethe overall latencyandimprove thenetwork utilisation.

BufferingcanbephysicallyreducedbydecreasingthesizeofSRAM in switches,also reducing hardwarecost. However, with shallow buffers, throughput can be signiﬁcantly lower under bursty traf- ﬁcduetohighnumberofpacketdropsandsynchronisedretrans- missions(IncastCollapse).CarefullytuningtheMinimumRetrans- missionTimeout(minRTO) allowshighthroughputtobeachieved whilekeepingthelatencylow[92,93].

OmniscientTCP(OTCP)[73,103] usesa SDNcontrollertokeep theglobalstateofthenetworkandtunetheminRTOandIWwhile anewrouteisbeingsetup.Thisworkshowsthattheburstyna- ture of DC traffic combined withlarge buffers and statically assigned congestioncontrol parameters, can significantly delayand slow down the transfer of new incoming flows. However, solely reducing the buffer sizes can prevent high throughput frombe- ing achievedif the defaultvalue of minRTOis used. Overall, the measurement-basedIWestimationallowsforreducedbufferoccu- pancyandconsequentlyboundsthelatency;andasmallerminRTO allowsthroughputtobeincreasedbypushingthecongestioncon- trollogicbacktotheend-hosts.

6. Researchchallengesandopportunities

DCs are built on top of legacy hardware and software tech- nologiescurrentlydeployedwithin ISPnetworks. Cloudoperators oftenassume highsimilarity betweenthe two environments and henceemploysimilarresourcemanagementprinciples–staticre- sourceadmissionandover-provisioning[15].However,therearefun- damental differences that are becoming apparent relatively early in the Cloud DCs’ lifetime and will only intensify as their utili- sationandcommoditisationincrease.The mainonesrelateto the levelofaggregationatwhich resourcesareprovisioned andmanaged, and atthe provisioning timescales. Over the Internet, ISPs operate a relatively limited set of functionson traffic aggregates overlongtimescales.Theycanthereforerelyonover-provisioning toaccommodateshort-termfluctuationsinload, solongasaggre- gatedemandispredictableover longtimescales.Onthecontrary, CloudDC operators manage a convergedICT environment where aplethora ofdiverseresourcesneedto beprovisioned overshort timescales,andatamuchfinergranularityatthelevelofindivid- ualflows, links, virtual machine images,etc. The consequent de- mandis thereforehighly unpredictable over both shortand long timescalesandDCoperators needto respondtorapidly-changing usagepatterns,asithasbeendemonstratedinanumberofCloud DCmeasurementstudies[10,90].

At the same time, the collocation and central ownership of computeandnetworkresourcesbyasingleCloudserviceprovider offers a unique opportunity for DC infrastructures to be provisioned in an adaptive, load-sensitive and converged manner, so that their usablecapacityheadroom andreturnon investment is increased, making Cloudcomputing infrastructures sustainable in thelongterm.