Live Workshop icon

July 16: Financial Services Roundtable

How-To Guide

Optimize the stack with cloud storage

Contents
How-To Guide

Optimize the stack with cloud storage

The amount of data being created by businesses is growing at an exponential rate. By 2025, IDC predicts that there will be 175 zettabytes of data by 2025, up from an estimated 50 zettabytes in 2020.

Data growth is outpacing the ability to store data locally. Gartner reports that “by 2025, 99% of all files will be stored in cloud environments. Employees will create, collaborate on, and store files from any device, without knowing if the files are stored locally or in the cloud.” 2

Benefits of cloud storage

  • Real-time data availability

  • Persistent storage

  • Redundancy and fault tolerance

  • Near infinite scalability

  • Low latency

  • Pay only for what is used

Cloud providers offer durable storage that scales nearly infinitely. Data can be retrieved from the cloud nearly as quickly as it can from local disks, depending on the configuration. A single API integrates storage into applications, making storage less expensive and more convenient.

Data stored in the cloud is an attractive solution for many users, yet there still remains a degree of financial, performance, and security pitfalls if it isn’t configured correctly.

Humio engineers recently migrated the storage for its SaaS log management platform from private network block storage to cloud object storage, gaining insights on the way. This document is created to point out pitfalls and offer suggestions to avoid these errors.

This article will show how to optimize cloud storage to:

  • Cut costs

  • Overcommit local disks

  • Extend retention time

  • Eliminate the need for replication and redundancy

  • Keep data safe from intruders and accidents

Features

  • Most compatible

  • Acts as mounted drive

  • Most expensive

  • Slow for large files

Main Features

  • Lowest latency

  • Ephemeral only

  • 60 TB limit

  • Data transfer charges

Main Features

  • Best for databases and transactions

  • Persists independently

  • Limited metadata

Main features

  • Least expensive

  • Best for unstructured streaming data

  • Can’t modify objects

Section icon for: Types of cloud storageTypes of cloud storage

There are three main types of data storage available from major cloud providers: object storage, file storage, and block storage. Each has unique benefits and tradeoffs that are worth understanding.

This guide shows how Humio engineers migrated the storage for its SaaS log management platform from private network block storage to cloud object storage. We share the following 5 steps as a way to get started, with more detailed information later in the page.

Steps to use modern log management for security

Choose the appropriate type of cloud storage

Explore each type of storage and associated benefits and tradeoffs.

Read more about each step

Section icon for: ChallengesChallenges

Improvements in speed, scale, and ease of use of cloud storage is making it an increasingly attractive option, especially as organizations scale out their digital business initiatives. But there are challenges that need to be addressed, especially when balancing the goals of optimizing costs while maximizing performance and security.

Hidden costs

From the start, cloud object storage can have significantly lower costs than local storage. Yet even with its inexpensive brates, users who move large volumes of data from one cluster to another can incur significant additions to their month-end bills. Some cloud providers charge for accessing data, especially when it’s stored in “cold” storage.

Speed

Cloud networks are optimized for fast rates of data transfer, but they don’t have the fastest storage disks. There are options provided that can overcome speed issues, but they come at a cost.

Volume

With digital business initiatives like the Internet of Things (IoT), volumes of data are growing exponentially. This data will be generated and processed at the edge, but it will likely be stored in the cloud.

Security

Once data is stored in the cloud, there are concerns that the security of that data can’t be completely controlled from the end-users perspective.

Compliance

There are some regulations that may affect what data can be collected, and how files are stored. For example, GDPR states that organizations can only collect personal data that is clearly related to a well-defined business objective. The recently enacted California Consumer Privacy Act (CCPA) protects consumer privacy rights. It regulates data belonging to individuals, including internet activity, IP addresses, cookies, and biometric data. It even regulates “household data” generated by IoT devices in the home.

Complexity

Cloud storage solutions come with varying levels of complexity. Optimizing for cloud storage often requires developer skills.

Data mobility

Making data accessible to applications the performance users expect can be challenging. This can lead to complicated applications and higher costs of managing and maintaining them.

5 Steps to optimize a stack with cloud storage

Each major cloud provider has several types of cloud storage available. The main types of storage are listed above (file, block, object), but looking at the offerings from major cloud providers can be overwhelming. Discussions of the main benefits usually focus on the costs of cloud storage, but it’s not a simple choice. There are options for storage amount, format, speed, region, availability, bandwidth, access, latency, and even SLA.

The best approach to making a choice is to focus on the type of cloud storage that is appropriate for each use case. This may take some time to determine, even though the initial choice seems obvious.

Be prepared to answer questions about use cases and requirements before looking for pricing. For example, these are just some of the variables used to determine the appropriate kind of object storage.

  • Storage format (file storage, block storage, object storage)

  • Storage class (for example, standard, tiered, infrequent access, cold storage, deep storage, etc.)

  • Speed and latency requirements

  • Location of data storage. There may be dozens of regions where data may be stored.

  • Size of storage

  • How long objects are stored during the month

Before choosing a cloud provider or storage type, make sure to understand all the costs associated with the use cases. Gartner reports that cost overruns will continue to be a problem for years to come.2

"By 2024, 60% of I&O leaders will see on-premises infrastructure budgets negatively impacted by cost overruns from public cloud deployments." 2

Balance storage features with costs

By design, cloud storage is built to scale infinitely with the number of files and the size of those files. The cost varies based on how much is stored in the cloud, but the architecture can be stretched indefinitely.

Determine if tiering is an option

For cloud providers that offer different prices for different levels of access, tiering services may be an option. Users set policies that move aging data after a specific period. This can be a benefit for storing rarely-used data for archival purposes. Be aware that there are charges for automatically moving data based on policies, and accessing the data.

Look closely at additional costs

Understand how baseline commitments can affect costs, and weigh the costs of effective capacity and provisioned capacity. Here are a few examples of additional costs that may be accrued.

  • Monitoring and automation fees for tiered storage, and per-request ingest fees when moving objects to a storage class.

  • Retrieval per GB from infrequent access storage.

  • Data transfer in and out of the cloud

  • Copies of data in an instance

  • Minimum charges apply for size or duration for some configurations.

  • Reservation terms

  • Payment options

To make things easier, use pricing calculators provided by major cloud providers.

Reduce costs by removing local data backups

Cloud storage users may not have to worry about having separate backups for data storage. Most types of cloud storage services automatically replicate data in multiple locations.

Compress data

Look for ways to compress data before moving it or storing it to the cloud. Compression is an engine that drives savings across every part of the system. Cloud object storage is already an inexpensive option for storage, and yet compression makes it significantly cheaper. Savings are realized as data is moved across the stack, and as the volume of data scales. Compression will save on storage costs, save on data transfer costs, and save time by speeding up all processes.

Avoid EC2 data transfer costs by using Amazon VPC

To cut down the costs of network traffic on AWS, use Amazon VPC, the networking layer for Amazon EC2. Use Amazon VPC to move data from one availability zone to another by routing through the internet with the VPC local gateway. Amazon VPC incurs no charge for traffic.

To get a hands-on introduction to Amazon VPC, see Getting started with Amazon VPC.

Over-commit for infinite local storage

Manage which files are kept on the local file system based on the amount of disk space used, and delete local files that also exist in cloud storage. This allows more files than the local disk has room for, allowing for infinite storage of events. There are no technical limitations in this over-committing scenario. The only limits are paying for additional cloud storage and potential transfer costs when the files required for a search are not present locally.

Use ephemeral NVMe storage for low latency

Using NVMe SSD disks will be dramatically faster than network-attached HDD storage. Traditional SATA storage devices use one command queue for up to 32 commands, NVMe supports 65,536 queues with 64K commands in each. Note that Using network-attached disks in combination with cloud storage is discouraged in cloud environments because it often results in using much more network bandwidth.

It’s important to keep in mind that the main concern of cloud service providers is uptime, and the security of the data comes second to them. It’s important to proactively manage the security of the data.

Security is a shared responsibility

While it may seem like cloud storage providers should be responsible for making sure data they store is compliant with regulations, they will be the first to say that compliance is a shared responsibility. Ultimately, the data being stored belongs to the organization and its customers.

The customer is responsible for keeping customer data secure, determining access and permission to data, keeping its platforms and applications secure, and maintaining a secure network and operating environment. They are also responsible for data encryption and data integrity, and protecting networking traffic.

Look out for open file shares

Open file shares and object storage are among the most common security risks of data in the public cloud. But they are also the easiest to prevent. Make sure that appropriate permissions are in place for anything online.

Encrypt data sent to cloud storage

Encrypt copies of data sent to cloud storage with AES-256 encryption while uploading. This ensures that even if read access to data is accidentally allowed, an attacker can’t read any events or other information from the data while in transit. When using a public cloud, no one at the cloud provider can look at the data.

Consider using an encryption key based on the seed key string set in the configuration. Each file gets encrypted using a unique key derived from that seed. The seed key is stored in a global file along with all the other information required to read and decrypt data contents.

Pay close attention to compliance considerations.

Organizations need to ensure that they are complying with every data security and privacy regulation. Consider how the organization ensures transparency and gives the owner of the data control of its use. Make sure that users understand the types of data collected and how it will be used. Consider how regulators and partners can be shown that the data collected and stored meets regulatory requirements.

The organization must stay informed about what is required for all types of data that is collected and stored. In many cases, there are specific requirements for creating policies for data governance, keeping data secure, protecting consumer data, and retaining records of compliance for auditing.

Here are a few examples of regulations that have significant impact on data collection and storage.

  • The General Data Protection Regulation (GDPR) is a European framework that was created to protect security and privacy for Personally Identifiable Information (PII). GDPR applies to any legal entity which stores, controls, or processes personal data for EU citizens.

  • The Health Insurance Portability and Accountability Act (HIPAA) pertains to organizations that transmit health information in electronic form in the United States. The HIPAA Security Management Process requires organizations to perform risk analysis, risk management, have a policy for data breaches, and conduct Information System Activity Reviews. Compliance data should be retained for up to six years.

  • The Payment Card Industry Data Security Standard (PCI DSS) intends to secure credit cardholder data from theft and misuse. There are 12 security areas for enhanced protection for data. It requires collecting system and security logs, and has logging requirements. Specifies the audit trail history retention for at least one year, with a minimum of three months immediately available for analysis.

  • The Sarbanes-Oxley Act of 2002 (SOX) sets requirements for US public company boards, management, and accounting firms. COSO and COBIT are frameworks used by IT organizations to comply with SOX. Specifies retaining audit logs for up to seven years.

  • The California Consumer Privacy Act (CCPA)went into effect on January 1, 2020. It is meant to protect the privacy rights of California consumers. It regulates the use of data belonging to individuals, including internet activity, IP addresses, cookies, and biometric data. It also regulates “household data” generated by IoT devices in the home.

Develop a system for controlling deployments and monitoring changes, especially across clusters. After a change, compare the bill from one month to the next to reveal how changes affected the cost. Even those with years of experience using cloud storage find surprises on their bill. It’s hard to anticipate all the charges until everything is actually up and running.

Cloud providers know that understanding how data storage and use can be overwhelming, and each has tools to help show how the monthly bill is calculated. For example, AWS has an interactive Cost Explorer that walks through how the bill is calculated. It’s a good idea to closely examine the bills as changes are made to the storage configuration.

Monitor the environment to optimize the configuration

  • So now that the cloud storage system is up and running, how can performance be tracked? Having the right observability tools can provide precious visibility into a system that from the start is more opaque than a self-hosted option.

  • Track everything that happens. It’s happening in the cloud system with a log management tool optimized from the start for live streaming data and historical searches.

  • The engineers at Humio monitor the Humio SaaS infrastructure being used by customers storing data in the cloud. They use a variety of data from the stack. From one viewing pane, they track things like CPU, memory, and disk I/O being used by each pod. They monitor vital observability information about how each process is operating from metrics and logs generated locally and by the cloud provider. As the team makes changes to the stack to optimize performance, they run tests and track how changes to the system influence the overall cost from the cloud provider.

  • NOTE: Humio SaaS runs in Kubernetes, but relies on persistent log data. The engineers designed the platform to store all retained data in S3-compatible object storage. using ephemeral NVMe disks to cache data for in-memory processing. They move data from cloud storage to the ephemeral disk when it is needed. Find out how we developed this solution in a video featuring Grant Schofield at Cloud Native London in May 2020: Stateful in a Stateless Land.

Section icon for: Humio log management uses bucket storage to increase retention and save storage costs for its customersHumio log management uses bucket storage to increase retention and save storage costs for its customers

Humio recently added support for bucket storage, unlocking lower storage costs while fitting neatly into Humio’s index-free architecture. The minimal and flexible design of bucket storage facilitates getting data into the system as quickly as possible, and provides streaming access to it. Bucket storage provides the potential for unlimited scalability of data retention in the cloud. Humio makes unlimited retention possible because its index-free structure is designed for streaming data, the same kind of data bucket storage was designed for.

Bucket data is ideal for Humio because it:

  • Works on cloud or self-hosted installations of Humio

  • Is optimized for write-once/read-many-times

  • Is not contingent upon editing files — it will accommodate unchanging log rules

  • Is appropriate for machine-based searching

  • Supports encryption

  • Allows overcommitment of local disks, saving on hardware costs

  • Keeps data safe with built-in redundancies

For SaaS or self-hosted installations, Humio supports using bucket storage for as much data as needed. It can search even months-old data in less than a second, just like it does with real-time streaming data. Using bucket storage, Humio treats all data as live data.

When you run a search, active data is automatically moved to the NVMe drives — the memory and the CPU cache — depending on how frequently it is read. The engineers behind the technology explain how it works in this short video.

HUMIO USES CLOUD STORAGE TO MAKE ALL DATA LIVE DATA

Section icon for: Get started with HumioGet started with Humio

Take a look at how fast Humio is using bucket storage, download a free trial to try storing, searching, and analyzing your organization’s log data. Set up a Humio free 30-day trial.

Have questions for Humio? Request a free 30-minute live demo.

Try Humio using a CloudFormation template

You can install Humio manually following the steps found in the Installation, but we also provide a couple of easy installation options for AWS.

If you want to ship logs from AWS CloudWatch to Humio, see the AWS CloudWatch Logs integration.

Quick Start – Single Node Trial

Use the following Launch Stack button to quickly try Humio on a new instance using a CloudFormation Template.

Launch stack

The template will create an instance and a data volume, and start Humio in single-user mode.

When the template is done, you can click the output link to log into Humio – give it a few moments to start.

Log in using the developer user and use the EC2 instance ID of the node running Humio as the password.

Humio will listen for HTTP traffic on port 8080, but behind a single-user login page. You can restrict access based on IP range if you want. For a production setup, we advise you to put an HTTPS proxy in front of Humio, or place it inside your VPC.

Sizing

Choosing the right instance size depends on your ingest volume and usage patterns. As a general guideline the following table is a starting point for sizing your Humio instance.

  • Up to 15 GB/day: m4.large

  • Up to 35 GB/day: m4.xlarge

  • Up to 75 GB/day: m4.2xlarge

  • Up to 150 GB/day: m4.4xlarge

GitHub

You can see the CloudFormation template on GitHub.

AWS Marketplace

Humio is available directly from the AWS Marketplace.

Section icon for: About HumioAbout Humio

Humio's log management platform offers the lowest total cost of ownership, industry-leading unlimited plans, minimal maintenance and training costs, and remarkably low compute and storage requirements. Humio is the only log management solution that enables customers to log everything to answer anything in real time — at scale, self-hosted or in the cloud. Humio's modern, index-free architecture makes exploring and investigating all data blazing fast, even at scale. Founded in 2016, Humio is headquartered in London and backed by Accel and Dell Technologies Capital.

For more information, visit www.humio.com and follow @MeetHumio on Twitter.

Get alerted to new How To Guides or get a PDF of this one

  1. IDC/Seagate: The Digitization of the World from Edge to Core, Nov 2018, David Reinsel, John Gantz, John Rydning

  2. Gartner: How to Escape Network and Local File Storage, Jan. 7, 2020, Lane Severson

  3. AWS: Protecting data with Amazon S3 Object Lock, Sep 5, 2019, Ruhi Dang

  4. AWS: How do I optimize the performance of my Amazon EBS Provisioned IOPS volumes?, Mar 31, 2020

  5. Gartner: Replace Assumptions With Accurate Estimates Before Migrating to Public Cloud, Feb 28, 2019, John McArthur (Gartner subscription required)

Start your free trial now, available Self-hosted and SaaS, or request a demo.