High Performance Computing in the Cloud Can Be Cost Effective
- Published: Tuesday, August 25, 2020 12:13
By Naidu Annamaneni
In 2017 eSilicon, a semiconductor design and manufacturer specializing in making chips tailored to its customer’s specific requirements, faced an enormous challenge. The company was being held back by the inherent limitations of the “datacenter as a service” solution that it had been using to run its High-Performance Computing (HPC) chip design workloads. However, the public cloud providers had not yet addressed the many technical challenges of doing this work—which included thousands of cores and peta bytes of storage—in the cloud.
As eSilicon’s CIO and Vice President of Global IT at the time, I led the team that partnered with Google and several other companies to address this challenge and become the first semiconductor maker to do end-to-end design in the cloud. Along the way I learned quite a bit about how to successfully and cost-effectively run HPC in the cloud.
The #1 key to success: Mindset
The biggest thing I learned is the importance of mindset. If you view the cloud as being a “giant datacenter,” and plan to run your workloads the same way you ran them in a datacenter, you will miss out on the transformational nature of cloud computing. Instead of using the cloud as a datacenter, you need to change your mindset and use the cloud as a cloud.
Unlike a datacenter, the cloud is not static. The cloud gives you completely elastic, flexible provisioning that can expand and contract based on true demand. This feature of the cloud makes it ideal for many HPC applications which, like chip design workloads, can be very elastic and spiky.
Chip Design: Typically computing resource requirements during final phase of tape-out are 2x to 4x steady state
One of the biggest problems with the static datacenter environment is waste. As the above graphic illustrates, much of the time you’re either creating waste in the form of extended wait times or creating waste in the form of unused capacity. Your total capacity will rarely, if ever, match up exactly with your actual demand.
With the right design, moving to the cloud can completely eliminate all of this waste. An important point here is something that many organizations overlook: It’s not enough to create a system with autoscaling features in provisioning the computing resources and running the jobs. You also must automatically deprovision as soon as the job is done, to eliminate the need to pay for idle resources.
At eSilicon we took advantage of the cloud’s elastic nature by designing a system that uses machine learning for flexible, automatic provisioning, thereby giving the engineers exactly what they need, when they need it. This ended the need for engineers to submit a request for X amount of cores, memory, etc., which in turn eliminated the tremendous amount of waste that would have been caused by engineers’ tendency to overestimate their needs “just to be safe.”
By replacing manual provisioning with auto provisioning and autoscaling we ensured that eSilicon’s cloud expands and contracts based on true, moment-by-moment demand. As a result, the company’s utilization is 100%! Not a single minute of cloud service is wasted.
Strategy for cloud migration success
Many HPC organizations have avoided moving to the cloud because they believe that the cost will be high. My experience at eSilicon demonstrates that when the move is done correctly, this is not the case. In fact, at eSilicon the total cost of ownership was about 20% less than what the company had been paying to do the same work utilizing a datacenter as a service solution.
If you’re considering moving your HPC to the cloud, here’s my advice:
- Get executive and Board buy-in.
- Establish close collaboration/partnership with engineering teams.
- Get clear understanding of your current baseline TCO for data center.
- Approach the cloud journey with the mindset that you will be running the cloud as a cloud, not as a datacenter. Embrace the cloud’s elasticity.
- Plan for an incremental journey. Start small. Identify a workload that can be moved to the cloud or hybrid cloud and begin with that.
- Once you are over the initial learning curve and know the ecosystem, vendors, technical issues, etc., and have begun to experience the benefits, then make the full move.
After your move to the cloud is complete you will discover that IT is no longer a limiting factor for your business’ growth. On a moment’s notice you’ll be able to exponentially expand your computing capacity—and then dial it back again when the need is gone. This true elasticity can power faster design cycles for more parallel runs, shorter time-to-market and, of course, lower costs, even for your most complex High-Performance Computing needs!
About Naidu Annamaneni
Naidu Annamaneni, Associate, CIO Professional Services, is a thought leader and expert in digital transformation, cloud, SaaS, AI/ML, security and agile development methodologies with over 25 years of experience. Most recently, he was CIO and vice president of global IT at eSilicon Corporation. As CIO, Naidu was responsible for overall IT strategy, security, including high performance computing infrastructure, all business and design software. Naidu holds a B.Tech in Electronics and Communications Engineering from Sri Venkateswara University and a Master’s Degree in Computer Science from Florida Atlantic University. He also holds two US patents in supply chain automation.
About CIO Professional Services
Based in the San Francisco Bay area, CIO Professional Services LLC is a top-rated Information Technology (IT) consulting firm focused on integrating Business and Information Technology. Our consultants are all hands-on executives who are veteran CIOs and Partners of Big 4 consulting firms. Companies come to us seeking assistance with their information technology strategy as well as for interim or fractional CIO / CTOs, and negotiation and program management/project rescue assistance.