1. Introduction

This document details how data security is implemented and enforced in the design and implementation of SherlockML. Maintaining the security of client data is ASI Data Science’s utmost priority, and data security has been built as an integral part of each component rather than as a separate layer. This model provides defence in depth and resilience against incidental and malicious threats to data security.

For the purposes of this document, the definition of data includes, but is not limited to, computer program source code, files containing numeric data, files containing textual data, image and video files, database dump files, program and server configurations, artefacts such as serialised models, customer information, employee information, and intellectual property in similar forms.

2. Data Classification

All data is classified based on its level of sensitivity and the impact to ASI Data Science and its clients should that data be disclosed, altered, or destroyed without authorisation. The classification of data helps determine what baseline security controls are appropriate for safeguarding that data. All client data is classified into one of three sensitivity levels, or classifications: Restricted data, Private data, and Public data.

Data is classified as Restricted when the unauthorised disclosure, alteration or destruction of that data could cause a significant level of risk to ASI Data Science or its client. Examples of Restricted data include data protected by government level privacy regulations and data protected by confidentiality agreements. The highest level of security controls should be applied to Restricted data.

Data is classified as Private when the unauthorised disclosure, alteration or destruction of that data could result in a moderate level of risk to ASI Data Science or its clients. By default, all institutional data that is not explicitly classified as Restricted or Public data should be treated as Private data. A reasonable level of security controls should be applied to Private data.

Data is classified as Public when the unauthorised disclosure, alteration or destruction of that data would result in little or no risk to ASI Data Science or its clients. Examples of Public data include press releases, open data, and research publications. While little or no controls are required to protect the confidentiality of Public data, some level of control is required to prevent its unauthorised modification or destruction.

3. Data Security Model

The SherlockML data security model is designed to satisfy four broad requirements:

1. application security and access control;

2. security of data at rest and in transit;

3. logging and auditing;

4. preparedness and incident response.

The provisions for satisfying each of these requirements are described in the remainder of this document. Security is a continuous process and ASI Data Science has a responsibility to stay abreast of developments in network and device security. As such, this document is reviewed every three months to ensure it is up-to-date, relevant, and consistent with industry best practises.

4. Application Security and Access Control

SherlockML is hosted using Amazon Web Services (AWS). All ASI Data Science administrators are mandated to have two factor authentication on their AWS accounts, and high entropy passwords stored in a password manager. The administrator accounts to not allow programmatic access: tasks that do require programmatic modification of AWS resources are performed through system users with appropriate permissions.

ASI Data Science administrator access to the AWS servers hosting SherlockML services is permitted only by public key-based SSH, with password authenticated access prohibited by the server configuration. ASI Data Science administrators’ private SSH keys are not shared, and are stored in the SherlockML private data vault.

Connections to ports not required to be accessible for the essential functionality of each server are blocked by firewall rules implemented using AWS security groups. Connections to ports required only for communication between internal systems, rather than client applications, are accepted only when the connection is initiated from specific whitelisted addresses. A recurrent scheduled task checks the programs listening for each open port against a whitelist of allowed programs.

A daily scheduled task on each server hosting SherlockML services checks for system security patches and installs them automatically.

Every client will have their own project and project user group. SherlockML users are added to the appropriate user group, and only users in this group can access data for that project. User management is contained within a single user management and permissions service. This allows user credentials, access rights, and project memberships to be managed from a single point. This allows a user’s credentials to be modified across all SherlockML resources--both storage and compute--in a single action, providing the ability to rapidly react to changing circumstances and modify a user’s access rights across the entire SherlockML platform.

Each user of SherlockML signs up with a username and password. There are rules in place to encourage users to have a secure password. Upon logging into the SherlockML user interface, the browser is provided with a short lived token which is presented when authenticating with each SherlockML service. Therefore, the long lived user password is not transmitted over the network to each SherlockML service. Upon receiving a user token, SherlockML services check the validity of the token with the user management and permissions service upon each request, allowing token invalidation to be quickly applied across the entire SherlockML platform from the user management and permissions service.

User authentication is granular, with all permissions being granted on a per project and service basis. Therefore a user can be permitted to access the filesystem for a given client project, but not the reports for that project, or the reports or filesystem for any other client project. User access to individual projects is controlled by project admins. There are no system level accounts in SherlockML permitting access to all services and projects, even for ASI Data Science administrators, and there is no privileged account permitting universal access (such as a “root” account).

5. Security of Data at Rest and in Transit

Communication of data between the user’s web browser, the SherlockML user interface, and between SherlockML internal services, is carried out over HTTPS. To ensure the security of data in transit, all HTTPS communications are secured with the TLS (Transport Layer Security) protocol. TLS is used universally, for transfers of all data whether classified as Restricted, Private, or Public. We use a 2048 bit TLS certificate issued by Amazon Web Services with a validity of 1 year. Certificates are automatically renewed before expiration.

A modern TLS configuration is used across all SherlockML services, providing perfect forward secrecy (PFS) to sufficiently modern clients to prevent the compromise of past communications in the case of a future TLS vulnerability. Currently, TLS protocol versions 1.1 and 1.2 are supported. This list, and the corresponding list of cipher suites, are reviewed by ASI Data Science data engineers on a regular basis to ensure they follow evolving industry best practices.

All communications between a SherlockML user’s workstation and their SherlockML compute instance initiated using the SherlockML command line client are carried over the SSH protocol. SSH authentication uses public keys rather than passwords, with a unique key for each compute instance. Client data remains in SherlockML and will not be downloaded to user workstations, enforced by administrative controls. Separate SherlockML compute instances are used for each combination of user and project: data from two different clients will not be present on the same server, and two different users cannot access the same individual compute instance.

The SherlockML datasets backend uses Amazon’s Simple Storage Service (S3) for object storage. Restricted, Private, and Public data stored on the SherlockML datasets are encrypted on the server-side using the AES block cipher with a unique 256 bit key. The per file encryption keys are encrypted with a master key which is stored in Amazon’s key management infrastructure, and regularly rotated. S3 is configured to refuse uploads of data which does not specify encryption at rest.

The SherlockML workspace backend uses Amazon’s Elastic File System (EFS) service for storage. Firewall rules ensure EFS file servers can only be accessed from servers within the SherlockML private network in AWS, and cannot, for example, be mounted on a laptop external to the AWS network.

ASI Data Science clients may load their data into SherlockML. Communications with this site are secured with TLS, and files uploaded are stored on Amazon S3 as described above.

On completion of a project, all compute instances associated with the project will be permanently terminated. All client data stored within SherlockML can be moved into encrypted glacial storage if requested by the client, and ASI Data Science keep audit logs for all data access during the lifetime of the project.

We state here explicitly that email must not be treated as a secure channel of communication. By default, email transits the network in plain text and the contents of an email message are therefore accessible to any party with access to any of the many connections and servers through which the email travels. Restricted and private data must not be transferred in email attachments without the use of strong encryption, such as OpenPGP or S/MIME.

6. Logging and Auditing

All user attempts to authenticate with SherlockML, whether failed or successful, are logged by the user management and permissions service. Logging includes the name of the user making the authentication attempt, and the SherlockML service and project for which the attempt was made.

All reads of data stored on SherlockML datasets are logged, with logs including the full time and date the data was read, the IP address from which the read was made, the identity of the user, the name and path within the filesystem of the data read, and the user agent of the client used to make the read.

Filesystem logs are retained indefinitely, encrypted with AES-256 and stored on Amazon S3 in buckets accessible only by ASI Data Science administrators. Upon request from the client, ASI Data Science administrators can provide full access logs to the corresponding client project for audit.

7. Preparedness and Incident Response

In any cases of suspicious activity on SherlockML, such as a user accessing SherlockML from a distant geographic location, or high bandwidth downloads from SherlockML, ASI Data Science administrators will immediately disable the credentials of the user in question as the first action, preventing all further access to SherlockML by that user. All logs, servers, and databases will be retained to permit later examination as required.

ASI Data Science administrators will investigate the nature and scope of the incident in order to apply necessary remediations to SherlockML infrastructure, and communicate this to ASI Data Science management. After the severity of the incident has been assessed, the client will be notified of the incident including all remediative actions carried out by ASI Data Science, if the severity of the incident requires it (i.e. excluding false alarms, such as a user accessing SherlockML from an unusual location because they are travelling to a conference).


Request a demo today