Jonathan Balkind Receives NSF Early CAREER Award to Go Faster in the Cloud

March 20, 2023

This article was originally published by James Badham for the College of Engineering Department.

Read the original article here.

----------------------------------------------

Jonathan Balkind, assistant professor in the UC Santa Barbara Computer Science Department, has received a five-year, $630,000 NSF Early CAREER award that will allow him to develop a new application for cloud computing.

“It's really an honor to receive the CAREER award!” Balkind said. “This is my first funded NSF proposal and was the first time I made a submission for the CAREER. I've had to pinch myself at least once to believe that it really happened. I'm really looking forward to driving this project over the next five years."

Balkind will use a technique called microarchitectural checkpointing to redesign computer processors for cloud-based serverless computing, which is a new paradigm favored by cloud developers.

The Long and Short of Application Run Times

Applications used to run for minutes, hours, days, or even weeks at a time, while the new serverless apps run for only tens to hundreds of milliseconds at a time. Balkind explains: “We have spent several decades optimizing processors for long-lived applications, so that the processor could learn their behavior over time in order to predict future behavior and, thus, operate more efficiently. With today’s very short-running applications, like those in serverless, there simply isn't enough time for our processors to learn the behavior. This makes it inefficient to run serverless applications on existing servers. But with microarchitectural checkpointing, you save what you learn each time the application runs, and then when you run it again later, you retrieve what you saved and then save at the end of that step, and so on. The result is that the processor learns only what it needs to, across instances of the application over time. The checkpointed information from each application is siloed, so you avoid polluting the information from one application with that of another. We will use microarchitectural checkpointing to improve the efficiency of serverless apps.”

Open-Sourcing a Customized Serverless Processor

A key to Balkind’s CAREER award research is the OpenPiton platform, which he will use to enable prototyping for his open-source framework for building processors. “We have a design for a processor, and people can make modifications to it, either to add a feature they want or to test things as they change the parameters of that processor, such as the number of cores or the amount of cache,” Balkind explains.

The system has evolved from work that began in 2013, when he and fellow Princeton University PhD students designed it to serve as a research platform that would enable users to add their own components in order to validate particular research ideas. “We give out just about every component that's needed to design a new processor, and as a result, we've seen users be very productive — more than sixty research projects have used the platform,” Balkind says. “We have also seen a number of companies adopt OpenPiton, including Intel, which used the platform to develop an 8-core processor chip to demonstrate the effectiveness of its new fabrication facilities.”

Balkind is a significant contributor to the open-source hardware space, where, he says, “We’re providing these designs and trying to build a community and make better products in the future.” He received an Open Source Hardware Association Trailblazer Fellowship for his work in that field.

Making and customizing processors for specific applications is an important part of the evolution of computing. “In industry, companies routinely customize their processors for new applications as they emerge,” Balkind explains. “Our proposal for microarchitectural checkpointing will be demonstrated as a customization of OpenPiton, which can benefit serverless applications. By open-sourcing this processor design and providing a concrete implementation of the idea, we hope that it will see easier adoption into other industrial processors.”

On-demand Cloud Computing

“If you’re a developer, there are lots of ways for you to write an app,” Balkind notes. “But if your app suddenly gets discovered, and you have, almost instantaneously, a million users a week, you need to have the flexibility to go from one server to ten thousand servers handling your requests. Serverless computing is specifically designed to do this for you.”

The way it came about was that in about 2016, Amazon and other companies discovered that they were utilizing only about 65 percent of their data-center capacity, leaving about one-third of the resource unused. Amazon responded to that finding by inventing a paradigm that would be easy to program and make it possible to scale up and down at a moment’s notice. “So, the NFL moved a bunch of their web serving to this paradigm, because they have two or three days a week when everyone uses the website, and the rest of the time it’s much quieter,” Balkind says. “It’s the same with banks. At the end of the month, customers scan their paycheck to deposit with their phone, causing a huge spike in demand that lasts three days a month. Spikes also occur that can’t be predicted.”

The hope is that not too many demand spikes occur at once, so that, Balkind says, “You get an even distribution of usage over time.”

The system makes sense, but there’s a catch, which is that the additional, previously unused 35 percent of capacity Amazon and other cloud providers had available isn’t as reliable as the heavily used 65 percent. “They can’t guarantee you’ll get good performance when your app runs,” Balkind says. “To make up for that, they sell a plan that allows you to pay for only the time when your application is running, whereas, normally, you pay even when it’s idle. “If you’re a small-scale start-up developer and you have no demand, it’s OK; you pay nothing. As people start to use your service, you pay in a way exactly commensurate with your usage.”

Amazon was the first to do this, with its Lambda platform, and, Balkind says, “Once they did it, everyone else followed. The problem, however, is that for each individual request, it turns out that they’re not getting great service — one command might run instantaneously, and the next might take thirty seconds. That’s what we’re trying to improve.”