Companies are generating increasingly large amounts of data and need to run ever more complex tasks. Computationally intensive tasks are, for such companies, essential but occasional, and so they may not want to build and sustain their own hardware infrastructure. Building a cluster requires purchasing the hardware, configuring it, and maintaining it. This is costly and resource inefficient.
Data Scientists using SherlockML worked with an Energy Research Organisation, using advanced unsupervised machine learning to extract trends in energy usage. The data involved exceeded 120 Terabytes, and was firmly in 'Big Data' territory, requiring not only massive amounts of distributed computing for the data processing, but also an economically viable cloud storage solution.
SherlockML not only simplified the automated rapid spawning of hundreds of servers but also made it very easy to launch a large batch map-reduce job on this cluster of servers. Each one of the workers in the cluster was initiated by flexible custom environments that can be easily applied when launching servers. This meant they were provisioned with the required packages pre-installed. This cost was surprisingly low, and had the advantage of being on a pay-per-use basis, which can be billed easily.
SherlockML and its command-line interface (SML-CLI) allow all of this to be done programmatically and integrated seamlessly with our existing cloud storage solutions -- in this case AWS S3. Both the master and worker node instructions were coded in simple python scripts calling SML-CLI.