Mosharaf Chowdhury

Jie Graduates. Congrats Dr. You!

Mosharaf — Fri, 05 Aug 2022 18:08:07 +0000

Jie (Jimmy) has recently become my third PhD student to graduate with a dissertation titled “Toward Practical Application-Aware Big Data Systems.” Over the course of his PhD, Jimmy ended up contributing to most of the major projects of our group over the years, and we will miss him going forward. He’s joining Meta.

Jimmy officially started his PhD at Michigan in Fall 2016, but he started working with me from Summer 2017. He started working on one of the first projects of SymbioticLab on geo-distributed analytics that had taken many names over the years (Gaia/Terra/System H). While the core project didn’t go as expected, it created tools and artifacts that helped other projects published in SPAA and NSDI that Jimmy contributed to. After that Jimmy learned that sometimes its better to come up with your idea, and he successfully led end-to-end the Kayak project that provided a middle-ground between KV- and RPC-based systems. This was done in collaboration with Xin and his group. After moving from high-latency to low-latency networks, Jimmy’s final project moved in another very different direction. In Zeus, Jimmy and Jae-Won collaborated on understanding and optimizing the energy consumption of DNN training. I think Zeus is the best of his works and will have a lasting impact. While it was stressful as an advisor to see him change course so many times in a PhD, it was also fun to see him eventually find his footing on his own terms.

Jimmy is very inquisitive, which led to him exploring many different things during his PhD. He is also very good at taking feedback and improving himself, which he’s clearly demonstrated over the past five years. I’m sure he’ll ask many questions and explore many new things in the next chapter(s) of his career.

ModelKeeper and Zeus Accepted to Appear at NSDI’2023

Mosharaf — Tue, 12 Jul 2022 17:06:43 +0000

Deep learning, and machine learning in general, is taking over the world. It is, however, quite expensive to tune, train, and serve deep learning models. Naturally, improving the efficiency and performance of deep learning workflows has received significant attention (Salus, Tiresias, and Fluid to name a few). Most of the existing works, including our prior works, focus on two primary ways to improve efficiency; and resource efficiency at that. The first is packing work as tightly as possible (placement). The second is scheduling over time. Some apply both together. None focus on improving energy efficiency. ModelKeeper and Zeus, respectively, are our efforts toward improving resource efficiency by not doing work and improving energy efficiency instead of solely focusing on resource usage efficiency.

ModelKeeper

We know scheduling and placement can improve efficiency of resource usage, but even with optimal algorithms one cannot reduce the amount of work that needs to be done in the general case. This simple observation led us to explore how can we reduce the amount of work that needs to be done when training DNN models. It turns out that instead of starting from random values and then training to reach the final values after training a model, one can potentially better initialize a model when training starts and short-circuit the process! By identifying similar models that had already been trained in the past, one can reduce the number of iterations needed for a model to converge.

With growing deployment of machine learning (ML) models, ML developers are training or re-training increasingly more deep neural networks (DNNs). They do so to find the most suitable model that meets their accuracy requirement while satisfying the resource and timeliness constraints of the target environment. In large shared clusters, the growing number of neural architecture search (NAS) and training jobs often result in models sharing architectural similarities with others from the same or a different ML developer. However, existing solutions do not provide a systematic mechanism to identify and leverage such similarities.
We present ModelKeeper, the first automated training warmup system that accelerates DNN training by repurposing previously-trained models in a shared cluster. Our key insight is that initializing a training job’s model by transforming an already-trained model’s weights can jump-start it and reduce the total amount of training needed. However, models submitted over time can differ in their architectures and accuracy. Given a new model to train, ModelKeeper scalably identifies its architectural similarity with previously trained models, selects a parent model with high similarity and good model accuracy, and performs structure-aware transformation of weights to preserve maximal information from the parent model during the warmup of new model weights. Our evaluations across thousands of CV and NLP models show that ModelKeeper achieves 1.3×–4.3× faster training completion with little overhead and no reduction in model accuracy.

Fan started the ModelKeeper project with Yinwei in late 2020 while Oort was making rounds and FedScale was in its infancy. With his internship with Meta in the middle and many other projects he’s been working on, ModelKeeper submission was pushed back a couple times. In hindsight, the extra time significantly improved the quality of the work. While the setting considered in this paper is cloud computing, ModelKeeper is likely going to be an integral part of the greater FedScale project now to speed up federated learning as well.

ModelKeeper is yet another collaboration between Harsha and myself. Hopefully, we will continue to collaborate more even after Harsha moves to USC in Winter 2023.

Zeus

With ever-increasing model sizes, the cost of DNN training is increasing rapidly. While the monetary cost is discussed often, there is an implicit energy cost of DNN training as well. For example, training the GPT-3 model consumes 1,287 megawatt-hour (MWh), which is equivalent to 120 years of electricity consumption for an average U.S. household. In this pioneering work, we take the first step in better understanding and then optimizing the energy consumption of DNN training. Specifically, we optimize batch size and GPU power cap for recurring training jobs to provide a better tradeoff between energy consumed and accuracy attained.

Training deep neural networks (DNNs) is becoming increasingly more resource- and energy-intensive every year. Unfortunately, existing works primarily focus on optimizing DNN training for faster completion, often without considering the impact on energy efficiency.
In this paper, we observe that common practices to improve training performance can often lead to inefficient energy usage. More importantly, we demonstrate that there is a tradeoff between energy consumption and performance optimization. To this end, we propose an optimization framework, Zeus, to navigate this tradeoff by automatically configuring job- and GPU-level configurations of recurring DNN training jobs. Zeus uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive offline measurements, while adapting to workload changes and data drifts over time. Our evaluation shows that Zeus can improve the energy efficiency of DNN training by 18.7%-72.8% for diverse workloads.

Zeus started sometime around Fall 2020/Winter 2021 with Jimmy. At the end of Winter, when Jimmy left for his internship, we had a basic idea of a problem and one motivating plot that’d eventually drive our efforts. With the arrival of Jae-Won in Fall 2021 as a first-year student and Jimmy being back from Meta, we picked up the pace, which eventually led to its submission. Zeus is the first Treehouse project, and my first foray into energy-related anything. We had a lot to learn, but I was in capable hands of Jimmy and Jae-Won, who learned and taught me much. And we haven’t even scratched the surface!

To work on many more exciting projects like these, join SymbioticLab!

Peifeng has Phinished. Congrats Dr. Yu!

Mosharaf — Thu, 02 Jun 2022 23:17:09 +0000

Peifeng just became my second student to finish PhD a few days ago after successfully defending his dissertation “Application-Aware Scheduling in Deep Learning Software Stacks.” This will be a big loss for the SymbioticLab as we will miss his presence and deep technical insights. Peifeng is joining Google to continue working on resource management systems for AI/ML.

Peifeng officially started his PhD in Fall 2017, but he started working with me on and off from the Fall before when he took EECS 582 with me as a master’s student at UM. Peifeng and his friend, Linh, were working on a term project on video captioning for that course, but Peifeng was interested into better designing systems for AI/ML instead of simply applying existing ML techniques to different use cases. Although I did not know anything about systems for AI/ML, Peifeng pulled me into this world. Since then, Peifeng has worked on several groundbreaking projects, including Salus and Fluid; Orloj, an even more exciting project is in the pipeline to be published. Salus was the first software GPU sharing solution that provided significantly higher utilization than NVIDIA MPS; Fluid was the first leverage the collective nature of jobs in hyperparameter tuning to improve GPU- and cluster-level utilizations. Orloj is the first inference system to provide predictable performance for dynamic DNNs while maintaining the best-in-class performance for traditional static DNNs. I enjoyed this journey thoroughly, learned a lot in the process, and am really proud to be called his advisor.

Peifeng is one of the best (ML) systems developers I have ever seen (and I have seen many luminaries over years). He cares more about doing his work than hyping them up. He is also unbothered by the publications rat race to the point of causing advisor anxiety.

I have no doubt he will be extremely successful in whatever he sets his mind to.

Promoted to Associate Professor With Tenure!

Mosharaf — Sat, 21 May 2022 01:53:00 +0000

All credits to my students and collaborators who helped realize many groundbreaking ideas; hundreds of students I taught in undergraduate and graduate levels; my mentors who continue to advise me through good times and bad; everyone who wrote letters for me; countless reviewers and panelists who improved our papers and proposals; funding entities (especially NSF) who support our large research program; those who invited me to be on and learn from program committees, panels, meetings, and talks; my research community as a whole; everyone in CSE Michigan; and countless others I’m sure I’m forgetting. Thanks to everyone from the depths of my heart.

Extra special thanks to my family who made countless sacrifices in unimaginable ways.

Well, it’s done now, but I don’t feel any different; only more energized to do bigger, better things. If you want to work on challenging problems, join SymbioticLab. We’re only getting started!

FedScale Accepted to Appear at ICML’2022

Mosharaf — Mon, 16 May 2022 16:35:03 +0000

Although theoretical federated learning (FL) research is growing exponentially, we are far from putting those theories into practice. Over the course of last few years, SymbioticLab has made significant progress in building deployable FL systems, with Oort being the most prominent example. As I discussed in the past, while evaluating Oort, we observed the weaknesses of the existing FL workloads/benchmarks: they are too small and sometimes too homogeneous to highlight the uncertainties that FL deployments would face in the real world. FedScale was borne out of the necessity to evaluate Oort. As we worked on it, we added more and more datasets to create a diverse benchmark that not only contains workloads to evaluate FL but also traces to emulate real-world end device characteristics. Eventually, we started building a runtime as well that one can use to implement any FL algorithm within FedScale. For example, Oort can be implemented with a few lines in FedScale, or a more recent work PyramidFL in MobiCom’22, which is based on Oort. This ICML paper gives an overview of the benchmarking aspects of FedScale for the ML/FL researchers, while providing a quick intro to the systems runtime that we are continuously working on and plan to publish later this year.

We present FedScale, a diverse set of challenging and realistic benchmark datasets to facilitate scalable, comprehensive, and reproducible federated learning (FL) research. FedScale datasets are large-scale, encompassing a wide range of important FL tasks, such as image classification, object detection, word prediction, speech recognition, and sequence prediction in video streaming. For each dataset, we provide a unified evaluation protocol using realistic data splits and evaluation metrics. To meet the pressing need for reproducing realistic FL at scale, we build an efficient evaluation platform to simplify and standardize the process of FL experimental setup and model evaluation. Our evaluation platform provides flexible APIs to implement new FL algorithms, and includes new execution backends (e.g., mobile backends) with minimal developer efforts. Finally, we perform systematic benchmark experiments on these datasets. Our experiments suggest fruitful opportunities in heterogeneity-aware co-optimizations of the system and statistical efficiency under realistic FL characteristics. FedScale will be open-source and actively maintained, and we welcome feedback and contributions from the community.

Fan and Yinwei had been working on FedScale for more than two years with some help from Xiangfeng toward the end of Oort. During this time, Jiachen and Sanjay joined first as users of FedScale and later as its contributors. Of course, Harsha is with us like all other past FL projects. Including this summer, close to 20 undergrads and master’s students have worked on/with/around it. At this point, FedScale has become the largest project in the SymbioticLab with interests from academic and industry users within and outside Michigan, and there is an active slack channel as well where users from many different institutions collaborate. We are also organizing the first FedScale Summer School this year. Overall, FedScale reminds me of another small project called Spark I was part of many years ago!

This is my/our first paper in ICML or any ML conference for that matter, even though it’s not necessarily a core ML paper. This year, ICML received 5630 submissions. Among these, 1117 were accepted for short and 118 for long presentations with a 21.94% acceptance rate; FedScale is one of the former. These numbers are mind boggling for me as someone from the systems community!

Join us in making FedScale even bigger, better, and more useful, as a member of SymbioticLab or as a FedScale user/contributor. Now that we have the research vehicle, possibilities are limitless. We are exploring maybe less than 10 such ideas, but 100s are waiting for you.

Visit http://fedscale.ai/ to learn more.

Apache Spark Receives the 2022 ACM SIGMOD Systems Award

Mosharaf — Sat, 14 May 2022 00:39:47 +0000

Congratulations to the whole Spark community on the prestigious award and its 1800+ contributors and innumerable users. As I look back, memories from years ago continue to remind me how fortunate I was to be able to make small contributions at the infancy of this juggernaut.

Aequitas Accepted to Appear at SIGCOMM’2022

Mosharaf — Sun, 08 May 2022 00:35:02 +0000

Although Remote Procedure Calls (RPCs) generate bulk of the network traffic in modern disaggregated datacenters, network-level optimizations still focus on low-level metrics at the granularity of packets, flowlets, and flows. The lack of application-level semantics lead to suboptimal performance as there is often a mismatch between network- and application-level metrics. Indeed, my works on coflow relied on a similar observation for distributed data-parallel jobs. In this work, we focus on leveraging application- and user-level semantics of RPCs to improve application- and user-perceived performance. Specifically, we implement quality-of-service (QoS) primitives at the RPC level by extending classic works on service curves and network calculus. We show that it can be achieved using an edge-based solutions in conjunction with traditional QoS classes in legacy/non-programmable switches.

With the increasing popularity of disaggregated storage and microservice architectures, high fan-out and fan-in Remote Procedure Calls (RPCs) now generate most of the traffic in modern datacenters. While the network plays a crucial role in RPC performance, traditional traffic classification categories cannot sufficiently capture their importance due to wide variations in RPC characteristics. As a result, meeting service-level objectives (SLOs), especially for performance-critical (PC) RPCs, remains challenging.

We present Aequitas, a distributed sender-driven admission control scheme that uses commodity Weighted-Fair Queuing (WFQ) to guarantee RPC-level SLOs. Aequitas maps PC RPCs to higher weight queues. In the presence of network overloads, it enforces cluster-wide RPC latency SLOs by limiting the amount of traffic admitted into any given QoS and downgrading the rest. We show analytically and empirically that this simple scheme works well. When the network demand spikes beyond provisioned capacity, Aequitas achieves a latency SLO that is 3.8× lower than the state-of-art congestion control at the 99.9th-percentile and admits up to 2× more PC RPCs meeting SLO when compared with pFabric, Qjump, D3, PDQ, and Homa. Results in a fleet-wide production deployment at a large cloud provider show a 10% latency improvement.

The inception of this project can be traced back to the spring of 2019, when I visited Google to attend a networking summit. Nandita and I had been trying to collaborate on something since 2016(!), and the time seemed right with Yiwen ready to go for an internship. Over the course of couple summers in 2019 and 2020, and many months before and after, Yiwen managed to build a theoretical framework that added QoS support to the classic network calculus framework in collaboration with Gautam. Yiwen also had a lot of support from Xian, PR , and unnamed others to get the simulator and the actual code implemented and deployed in Google datacenters. They implemented a public-facing/shareable version of the simulator as well! According to Amin and Nandita, this is one of the “big” problems, and I’m happy that we managed to pull it off.

This was my first time working with Google. Honestly, I was quite impressed by the rigor of the internal review process (even though the paper still got rejected a couple times). It’s also my first paper with Gautam after 2012! He was great then, and he’s gotten even better in the past decade. Yiwen is also having a great year with a SIGCOMM paper right after his NSDI one.

This year SIGCOMM PC accepted 55 out 281 submissions after a couple years of record-breaking acceptance rates.

Thanks Cisco for Sponsoring Our FL Research

Mosharaf — Sun, 01 May 2022 23:53:00 +0000

Look forward to even more work in the context of federated learning and edge AI/ML building on top of FedScale.

Join SymbioticLab if you are excited about building practical federated learning and analytics systems that can be deployed in the wild!

Treehouse White Paper Released

Mosharaf — Fri, 07 Jan 2022 15:03:05 +0000

We started the Treehouse project at the beginning of last summer with a mini workshop. It took a while to distill our core ideas, but we finally have a white paper that attempts to provide answers to some frequently asked questions: why now? what does it mean for cloud software to be carbon-aware? what are (some of) the ways to get there? why these? etc.

It’s only the beginning, and as we go deeper into our explorations, I’m sure the details will evolve and become richer. Nonetheless, I do believe that we have a compelling vision of the future where we can, among other things, enable reasonable tradeoffs between performance and carbon consumption of cloud software stacks, an accurate way of understanding how much carbon our software systems are consuming, and optimally choose consumption sites, sources, and timing to best address the energy challenges we are all facing.

We look forward to hearing from the community. Watch out for exciting announcements coming soon from the Treehouse project!

Hydra Accepted to Appear at FAST’2022

Mosharaf — Sun, 12 Dec 2021 03:17:28 +0000

Almost five years after Infiniswap, memory disaggregation is now a mainstream research topic. It goes by many names, but the idea of accessing (unused/stranded) memory over high-speed networks is now close to reality. Despite many works in this line of research, a key remaining problem is ensuring resilience: how can applications recover quickly if remote memory fails, become corrupted, or is inaccessible? Keeping a copy in the disk like Infiniswap causes performance bottleneck under failure, but keeping in-memory replicas doubles the memory overhead. We started Hydra to hit a sweet spot between these two extremes by applying lessons learned from our EC-Cache project to extremely small objects. While EC-Cache explicitly focused on very large objects in the MB range, Hydra aims to perform erasure coding at the 4KB page granularity in microsecond timescale common for RDMA. Additionally, we extended Asaf’s CopySets idea to erasure coding to tolerate concurrent failures with low overhead.

We present Hydra, a low-latency, low-overhead, and highly available resilience mechanism for remote memory. Hydra can access erasure-coded remote memory within a single-digit μs read/write latency, significantly improving the performance-efficiency tradeoff over the state-of-the-art – it performs similar to in-memory replication with 1.6× lower memory overhead. We also propose CodingSets, a novel coding group placement algorithm for erasure-coded data, that provides load balancing while reducing the probability of data loss under correlated failures by an order of magnitude. With Hydra, even when only 50% memory is local, unmodified memory-intensive applications achieve performance close to that of the fully in-memory case in the presence of remote failures and outperforms the state-of-the-art remote-memory solutions by up to 4.35×.

Youngmoon started working on Hydra right after we presented Infiniswap in early 2017! As Youngmoon graduated, Hasan started leading the project from early 2019, and Asaf joined us later that year. Together they significantly improved the paper over early drafts. Even then, Hydra faced immense challenges. In the process, Hydra has now taken over Justitia for the notorious distinction of my current record for accepted-after-N-submissions.

This was my first time submitting to FAST.

Presented a Keynote Talk on FL Systems at DistributedML’21

Mosharaf — Wed, 08 Dec 2021 15:03:18 +0000

This week, I presented a keynote talk at the DistributedML’21 workshop on our recent works on building software systems support for practical federated computation (Sol, Oort, and FedScale). This is a longer version of my talk at Google FL workshop last month, and the updated slides go into more details of our research on cross-silo and cross-device federated learning and analytics.

Presented a Keynote at 2021 Workshop on Federated Learning and Analytics

Mosharaf — Sat, 13 Nov 2021 13:00:33 +0000

Recently I presented a keynote talk at the 2021 federated learning and analytics workshop organized Google on our recent works on building software systems support for practical federated computation (Sol, Oort, and FedScale).

I based my talk around the similarities and differences between software stacks for cloud systems and federated systems for learning and analytics. While we still want to perform similar computation (to some extent), the underlying network appears as one of the biggest challenges for the latter. Because the wide-area network (WAN) has significantly lower bandwidth and higher latency than that of a datacenter, federated software stacks have to be rethought with those constraints in mind. Small tweaks here and there are not enough.

Federated learning and analytics systems also come in two broad flavors: cross-silo and cross-device. In case of the former, few computationally powerful and reliable facilities are connected by the WAN, each with several/many powerful computation devices. For the latter, massive number of weak and unreliable devices (e.g., smartphones) take part in the computation. Naturally, cross-device solutions have to deal with additional challenges beyond dealing with the network. The devices have resource and battery constraints, and their owners may not always be connected or have unique behavioral/charging patterns. How do we reason about learning and analytics under such uncertainty?

While the former two topics focused on systems research, the third piece of my talk was about providing a service to my machine learning (ML) colleagues so that they can easily implement and evaluate large-scale federated systems. I believe that systems researchers will fail their ML counterparts if an ML person have to spend considerable time in building and tinkering with systems instead of spending that time on developing new ideas and algorithms. To this end, I talked about the challenges in building such a benchmarking dataset and experimental harness.

I want to thank Peter Kairouz and Marco Gruteser from Google for inviting me to the workshop. My slides are available here and have more details.

FedScale Wins the Best Paper Award at ResilientFL’2021

Mosharaf — Mon, 25 Oct 2021 15:39:31 +0000

Many congratulations to Fan, Yinwei, and Xiangfeng!

Check out FedScale at fedscale.ai

Juncheng Levels Up. Congrats Dr. Gu!

Mosharaf — Sat, 28 Aug 2021 02:41:04 +0000

My first Ph.D. student Juncheng Gu graduated earlier this month after successfully defending his dissertation titled “Efficient Resource Management for Deep Learning Clusters.” This is a bittersweet moment. While I am extremely proud of everything he has done, I will miss having him around. I do know that a bigger stage awaits him; Juncheng is joining the ByteDance AI Lab to build practical systems for AI and machine learning!

Juncheng started his Ph.D. in the Fall of 2015 right before I started in Michigan. I joined his then advisor Kang Shin to co-advise him as he started working on a pre-cursor to Infiniswap as a term project for the EECS 582 course I was teaching. Since then, Juncheng worked on many projects that ranged from hardware, systems, and machine learning/computer vision with varying levels of luck and success, but they were all meaningful works. I consider him a generalist in his research taste. Infiniswap and Tiresias stand out the most among his projects. Infiniswap heralded the rise of many followups we see today on the topic of memory disaggregation. It was the first of its kind and introduced many around the world to this new area of research. Tiresias was one of the earliest works on GPU cluster management and certainly the first that did not require any prior knowledge about deep learning jobs’ characteristics to effectively allocate GPUs for them and to schedule them. To this day, it is the best of its kind for distributed deep learning training. I am honored to have had the opportunity to advise Juncheng.

Juncheng is a great researcher, but he is an even better person. He is very down-to-earth and tries his best to help others out whenever possible. He also understates and underestimates what he can do and has achieved, often to a fault.

I wish him a fruitful career and a prosperous life!

Oort Wins the Distinguished Artifact Award at OSDI’2021. Congrats Fan and Xiangfeng!

Mosharaf — Wed, 14 Jul 2021 22:26:42 +0000

Oort, our federated learning system for scalable machine learning over millions of edge devices has received the distinguished artifact award at this year’s USENIX OSDI conference!

This is a testament to a lot of hard work put in by Fan and Xiangfeng over the course of last couple years. Oort is our first foray into federated learning, but it certainly is not the last.

Oort and it’s workloads (FedScale) are both open-source at https://github.com/symbioticlab.

NSF Award to Expand Our Federated Learning Research!

Mosharaf — Fri, 09 Jul 2021 15:47:59 +0000

This collaborative project with Harsha Madhyastha (Michigan) and Aditya Akella (UT Austin) aims to extend and expand our recent forays into federated learning and analytics. Join us in this adventure.

Thanks NSF for supporting our research!

Justitia Accepted to Appear at NSDI’2022

Mosharaf — Sun, 13 Jun 2021 18:20:55 +0000

The need for higher throughput and lower latency is driving kernel-bypass networking (KBN) in datacenters. Of the two related trends in KBN, hardware-based KBN is especially challenging because, unlike software KBN such as DPDK, it does not provide any control once a request is posted to the hardware. RDMA, which is the most prevalent form of hardware KBN in practice, is completely opaque. RDMA NICs (RNICs), whether they are using InfiniBand, RoCE, or iWARP, have fixed-function scheduling algorithms programmed in them. Like any other networking component, they also suffer from performance isolation issues when multiple applications compete for RNIC resources. Justitia is our attempt at introducing software control in hardware KBN.

Kernel-bypass networking (KBN) is becoming the new norm in modern datacenters. While hardware-based KBN offloads all dataplane tasks to specialized NICs to achieve better latency and CPU efficiency than software-based KBN, it also takes away the operator’s control over network sharing policies. Providing policy support in multi-tenant hardware KBN brings unique challenges — namely, preserving ultra-low latency and low CPU cost, finding a well-defined point of mediation, and rethinking traffic shapers. We present Justitia to address these challenges with three key design aspects: (i) Split Connection with message-level shaping, (ii) sender-based resource mediation together with receiver-side updates, and (iii) passive latency monitoring. Using a latency target as its knob, Justitia enables multi-tenancy policies such as predictable latencies and fair/weighted resource sharing. Our evaluation shows Justitia can effectively isolate latency-sensitive applications at the cost of slightly decreased utilization and ensure that throughput and bandwidth of the rest are not unfairly penalized.

Yiwen started working on this problem when we first observed RDMA isolation issues in Infiniswap. He even wrote a short paper in KBNets 2017 based on his early findings. Yue worked on it for quite a few months before she went to Princeton for Ph.D. Brent has been helping us getting this work into shape since the beginning. It’s been a long and arduous road; every time we fixed something, new reviewers didn’t like something else. Finally, an NSDI revision allowed us to directly address the most pressing concerns. Without commenting on how much the paper has improved after all these iterations, I can say that adding revisions to NSDI has saved us, especially Yiwen, a lot more frustrations. For what it’s worth, Justitia now has the notorious distinction of my current record for accepted-after-N-submissions; it’s been so long that I’ve lost track of the exact value of N!

FedScale Released on GitHub

Mosharaf — Sun, 30 May 2021 15:48:38 +0000

Anyone working on federated learning (FL) has faced this problem at least once: you are reading two papers and they either use very different datasets for performance evaluation or unclear about their experimental assumptions about the runtime environment, or both. They often deal with very small datasets as well. There have been attempts at solutions too, creating many FL benchmarks. In the process of working on Oort, we faced the same problem(s). Unfortunately, none of the existing benchmarks fit our requirements. We had to create one on our own.

We present FedScale, a diverse set of challenging and realistic benchmark datasets to facilitate scalable, comprehensive, and reproducible federated learning (FL) research. FedScale datasets are large-scale, encompassing a diverse range of important FL tasks, such as image classification, object detection, language modeling, speech recognition, and reinforcement learning. For each dataset, we provide a unified evaluation protocol using realistic data splits and evaluation metrics. To meet the pressing need for reproducing realistic FL at scale, we have also built an efficient evaluation platform to simplify and standardize the process of FL experimental setup and model evaluation. Our evaluation platform provides flexible APIs to implement new FL algorithms and include new execution backends with minimal developer efforts. Finally, we perform in-depth benchmark experiments on these datasets. Our experiments suggest that FedScale presents significant challenges of heterogeneity-aware co-optimizations of the system and statistical efficiency under realistic FL characteristics, indicating fruitful opportunities for future research. FedScale is open-source with permissive licenses and actively maintained, and we welcome feedback and contributions from the community.

You can read up on the details on our paper and check it out on Github. Do check it out and contribute so that we can together build a large-scale benchmark that considers both data and system heterogeneity across a variety of application domains.

Fan, Yinwei, and Xiangfeng have put in tremendous amount of work over almost two years to get to this point, and I’m super excited about its future.

AIFO Accepted to Appear at SIGCOMM’2021

Mosharaf — Tue, 04 May 2021 15:12:14 +0000

Packet scheduling is a classic problem in networking. In recent years, however, the focus on packet scheduling has somewhat shifted from designing new scheduling algorithms to designing generalized frameworks that can be programmed to approximate a variety of scheduling disciplines. Push-In-First-Out (PIFO) from SIGCOMM 2016 is such a framework that has been shown to be quite expressive. Following that, a variety have solutions have attempted to make implementing PIFO more practical, with the key problem being minimizing the number of priority levels needed. SP-PIFO is a recent take that shows that having a handful queues is enough. This leaves us with an obvious question: what’s the minimum number of queues one needs to approximate PIFO? We show that the answer is just one.

Programmable packet scheduling enables scheduling algorithms to be programmed into the data plane without changing the hardware. Existing proposals either have no hardware implementations or require multiple strict-priority queues.
We present Admission-In First-Out (AIFO) queues, a new solution for programmable packet scheduling that uses only a single first-in first-out queue. AIFO is motivated by the confluence of two recent trends: shallow buffers in switches and fast-converging congestion control in end hosts, that together leads to a simple observation: the decisive factor in a flow’s completion time (FCT) in modern datacenter networks is often which packets are enqueued or dropped, not the ordering they leave the switch. The core idea of AIFO is to maintain a sliding window to track the ranks of recent packets and compute the relative rank of an arriving packet in the window for admission control. Theoretically, we prove that AIFO provides bounded performance to Push-In First-Out (PIFO). Empirically, we fully implement AIFO and evaluate AIFO with a range of real workloads, demonstrating AIFO closely approximates PIFO. Importantly, unlike PIFO, AIFO can run at line rate on existing hardware and use minimal switch resources—as few as a single queue.

Although programmable packet scheduling has been quite popular for more than five years, I started paying careful attention only after the SP-PIFO presentation in NSDI 2020. I felt that we should be able to approximate something like that with even fewer priority classes, especially by using something similar to Foreground-Background scheduling that needs only two priorities. Xin had been thinking about the problem even longer given his vast experience in programmable switches and approached me after submitting Kayak to NSDI 2021. Xin pointed out that two priorities need only queue with an admission control mechanism in front! I’m glad he roped me in as it’s always a pleasure working with him and Zhuolong. It seems unbelievable even to me that this is my first packet scheduling paper!

This year SIGCOMM has broken the acceptance record once again by accepting 55 out of 241 submissions into the program!

Treehouse Funded. Thanks NSF and VMware!

Mosharaf — Sat, 10 Apr 2021 19:23:43 +0000

Treehouse, our proposal on energy-first software infrastructure designs for sustainable cloud computing has recently won a three million dollar joint funding from NSF and VMware! Here is a link to the VMware press release. SymbioticLab now has a chance to collaborate with a stellar team consisting of Tom Anderson, Adam Belay, Asaf Cidon, and Irene Zhang, and I’m super excited about the changes Treehouse will bring to how we do sustainable cloud computing in near future.

Our team started with Asaf and I looking to go beyond memory disaggregation in early Fall of 2020. Soon we managed to convince Adam to join us, who introduced Irene to the idea; both of them are experts in efficient software designs. Finally, together we managed to get Tom to lead our team, and many ideas about an energy-first redesign of cloud computing software stacks started to sprout.

The early steps in building Treehouse is already under way. Last month, we had our first inaugural mini-workshop with more than 20 attendees, and I expect it to grow only bigger in the near future. Stay tuned!