Cloud Crystal Clear: making HPC jobs more efficient

Cloud Crystal is an exciting project examining how historical queue data from HPC clusters can be used to more accurately forecast resource availability. The findings will help users of our on-demand HPC product, EPIC, to better plan their time and manage costs. Cloud Crystal is supported by Innovate UK, and we are working with the Institute of Risk and Uncertainty, University of Liverpool.

Researchers at the University of Liverpool have developed two Machine Learning models based on data provided by Zenotech – LTSM (Long Term Short Memory) and OR (Ordinal Regressive). We are integrating these models into the EPIC platform to provide intelligent feedback to users about expected queue times.

The LTSM model is queried when a user submits a job, and once EPIC knows the runtime and number of cores required it forecasts a queue time for each cluster. The Ordinal Regression model is integrated with the resources information in EPIC where there are two views of this data; one with the probability per queueing class and the other the cumulative probability of a job running before a certain time.

Adding these two models to EPIC enables users to achieve a more informed idea of job queuing times prior to submitting them. Combined with knowledge of the jobs’ runtime, this will enhance their ability to accurately forecast when results will be ready and therefore improve workload planning – saving time and budgeting more efficiently.

EPIC is particularly suited for Cloud Crystal as it can bring clarity to job planning. After applying the models to a variety of HPC and cloud providers, which provide a high-level overview of a vast pool of computing resource, EPIC can place jobs in order to reduce overall turnaround time.

Follow the blog and our Twitter account for updates about Cloud Crystal.

 

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*