Optimize the stack with cloud storage
Optimize the stack with cloud storage
The amount of data being created by businesses is growing at an exponential rate. By 2025, IDC predicts that there will be 175 zettabytes of data by 2025, up from an estimated 50 zettabytes in 2020.
Annual Size of the Global Datasphere
Data growth is outpacing the ability to store data locally. Gartner reports that “by 2025, 99% of all files will be stored in cloud environments. Employees will create, collaborate on, and store files from any device, without knowing if the files are stored locally or in the cloud.” 2
Benefits of cloud storage
Real-time data availability
Redundancy and fault tolerance
Near infinite scalability
Pay only for what is used
Cloud providers offer durable storage that scales nearly infinitely. Data can be retrieved from the cloud nearly as quickly as it can from local disks, depending on the configuration. A single API integrates storage into applications, making storage less expensive and more convenient.
Data stored in the cloud is an attractive solution for many users, yet there still remains a degree of financial, performance, and security pitfalls if it isn’t configured correctly.
Humio engineers recently migrated the storage for its SaaS log management platform from private network block storage to cloud object storage, gaining insights on the way. This document is created to point out pitfalls and offer suggestions to avoid these errors.
This article will show how to optimize cloud storage to:
Overcommit local disks
Extend retention time
Eliminate the need for replication and redundancy
Keep data safe from intruders and accidents
Acts as mounted drive
Slow for large files
60 TB limit
Data transfer charges
Best for databases and transactions
Best for unstructured streaming data
Can’t modify objects
Types of cloud storage
There are three main types of data storage available from major cloud providers: object storage, file storage, and block storage. Each has unique benefits and tradeoffs that are worth understanding.
This guide shows how Humio engineers migrated the storage for its SaaS log management platform from private network block storage to cloud object storage. We share the following 5 steps as a way to get started, with more detailed information later in the page.
Steps to use modern log management for security
Choose the appropriate type of cloud storage
Explore each type of storage and associated benefits and tradeoffs.
Improvements in speed, scale, and ease of use of cloud storage is making it an increasingly attractive option, especially as organizations scale out their digital business initiatives. But there are challenges that need to be addressed, especially when balancing the goals of optimizing costs while maximizing performance and security.
From the start, cloud object storage can have significantly lower costs than local storage. Yet even with its inexpensive brates, users who move large volumes of data from one cluster to another can incur significant additions to their month-end bills. Some cloud providers charge for accessing data, especially when it’s stored in “cold” storage.
Cloud networks are optimized for fast rates of data transfer, but they don’t have the fastest storage disks. There are options provided that can overcome speed issues, but they come at a cost.
With digital business initiatives like the Internet of Things (IoT), volumes of data are growing exponentially. This data will be generated and processed at the edge, but it will likely be stored in the cloud.
Once data is stored in the cloud, there are concerns that the security of that data can’t be completely controlled from the end-users perspective.
There are some regulations that may affect what data can be collected, and how files are stored. For example, GDPR states that organizations can only collect personal data that is clearly related to a well-defined business objective. The recently enacted California Consumer Privacy Act (CCPA) protects consumer privacy rights. It regulates data belonging to individuals, including internet activity, IP addresses, cookies, and biometric data. It even regulates “household data” generated by IoT devices in the home.
Cloud storage solutions come with varying levels of complexity. Optimizing for cloud storage often requires developer skills.
Making data accessible to applications the performance users expect can be challenging. This can lead to complicated applications and higher costs of managing and maintaining them.
5 Steps to optimize a stack with cloud storage
Each major cloud provider has several types of cloud storage available. The main types of storage are listed above (file, block, object), but looking at the offerings from major cloud providers can be overwhelming. Discussions of the main benefits usually focus on the costs of cloud storage, but it’s not a simple choice. There are options for storage amount, format, speed, region, availability, bandwidth, access, latency, and even SLA.
The best approach to making a choice is to focus on the type of cloud storage that is appropriate for each use case. This may take some time to determine, even though the initial choice seems obvious.
Be prepared to answer questions about use cases and requirements before looking for pricing. For example, these are just some of the variables used to determine the appropriate kind of object storage.
Storage format (file storage, block storage, object storage)
Storage class (for example, standard, tiered, infrequent access, cold storage, deep storage, etc.)
Speed and latency requirements
Location of data storage. There may be dozens of regions where data may be stored.
Size of storage
How long objects are stored during the month
Before choosing a cloud provider or storage type, make sure to understand all the costs associated with the use cases. Gartner reports that cost overruns will continue to be a problem for years to come.2
"By 2024, 60% of I&O leaders will see on-premises infrastructure budgets negatively impacted by cost overruns from public cloud deployments." 2
Balance storage features with costs
By design, cloud storage is built to scale infinitely with the number of files and the size of those files. The cost varies based on how much is stored in the cloud, but the architecture can be stretched indefinitely.
Determine if tiering is an option
For cloud providers that offer different prices for different levels of access, tiering services may be an option. Users set policies that move aging data after a specific period. This can be a benefit for storing rarely-used data for archival purposes. Be aware that there are charges for automatically moving data based on policies, and accessing the data.
Look closely at additional costs
Understand how baseline commitments can affect costs, and weigh the costs of effective capacity and provisioned capacity. Here are a few examples of additional costs that may be accrued.
Monitoring and automation fees for tiered storage, and per-request ingest fees when moving objects to a storage class.
Retrieval per GB from infrequent access storage.
Data transfer in and out of the cloud
Copies of data in an instance
Minimum charges apply for size or duration for some configurations.
To make things easier, use pricing calculators provided by major cloud providers.
Reduce costs by removing local data backups
Cloud storage users may not have to worry about having separate backups for data storage. Most types of cloud storage services automatically replicate data in multiple locations.
Look for ways to compress data before moving it or storing it to the cloud. Compression is an engine that drives savings across every part of the system. Cloud object storage is already an inexpensive option for storage, and yet compression makes it significantly cheaper. Savings are realized as data is moved across the stack, and as the volume of data scales. Compression will save on storage costs, save on data transfer costs, and save time by speeding up all processes.
Avoid EC2 data transfer costs by using Amazon VPC
To cut down the costs of network traffic on AWS, use Amazon VPC, the networking layer for Amazon EC2. Use Amazon VPC to move data from one availability zone to another by routing through the internet with the VPC local gateway. Amazon VPC incurs no charge for traffic.
To get a hands-on introduction to Amazon VPC, see Getting started with Amazon VPC.
Over-commit for infinite local storage
Manage which files are kept on the local file system based on the amount of disk space used, and delete local files that also exist in cloud storage. This allows more files than the local disk has room for, allowing for infinite storage of events. There are no technical limitations in this over-committing scenario. The only limits are paying for additional cloud storage and potential transfer costs when the files required for a search are not present locally.
Use ephemeral NVMe storage for low latency
Using NVMe SSD disks will be dramatically faster than network-attached HDD storage. Traditional SATA storage devices use one command queue for up to 32 commands, NVMe supports 65,536 queues with 64K commands in each. Note that Using network-attached disks in combination with cloud storage is discouraged in cloud environments because it often results in using much more network bandwidth.
It’s important to keep in mind that the main concern of cloud service providers is uptime, and the security of the data comes second to them. It’s important to proactively manage the security of the data.
Security is a shared responsibility
While it may seem like cloud storage providers should be responsible for making sure data they store is compliant with regulations, they will be the first to say that compliance is a shared responsibility. Ultimately, the data being stored belongs to the organization and its customers.
The customer is responsible for keeping customer data secure, determining access and permission to data, keeping its platforms and applications secure, and maintaining a secure network and operating environment. They are also responsible for data encryption and data integrity, and protecting networking traffic.
Look out for open file shares
Open file shares and object storage are among the most common security risks of data in the public cloud. But they are also the easiest to prevent. Make sure that appropriate permissions are in place for anything online.
Encrypt data sent to cloud storage
Encrypt copies of data sent to cloud storage with AES-256 encryption while uploading. This ensures that even if read access to data is accidentally allowed, an attacker can’t read any events or other information from the data while in transit. When using a public cloud, no one at the cloud provider can look at the data.
Consider using an encryption key based on the seed key string set in the configuration. Each file gets encrypted using a unique key derived from that seed. The seed key is stored in a global file along with all the other information required to read and decrypt data contents.
Pay close attention to compliance considerations.
Organizations need to ensure that they are complying with every data security and privacy regulation. Consider how the organization ensures transparency and gives the owner of the data control of its use. Make sure that users understand the types of data collected and how it will be used. Consider how regulators and partners can be shown that the data collected and stored meets regulatory requirements.
The organization must stay informed about what is required for all types of data that is collected and stored. In many cases, there are specific requirements for creating policies for data governance, keeping data secure, protecting consumer data, and retaining records of compliance for auditing.
Here are a few examples of regulations that have significant impact on data collection and storage.
The General Data Protection Regulation (GDPR) is a European framework that was created to protect security and privacy for Personally Identifiable Information (PII). GDPR applies to any legal entity which stores, controls, or processes personal data for EU citizens.
The Health Insurance Portability and Accountability Act (HIPAA) pertains to organizations that transmit health information in electronic form in the United States. The HIPAA Security Management Process requires organizations to perform risk analysis, risk management, have a policy for data breaches, and conduct Information System Activity Reviews. Compliance data should be retained for up to six years.
The Payment Card Industry Data Security Standard (PCI DSS) intends to secure credit cardholder data from theft and misuse. There are 12 security areas for enhanced protection for data. It requires collecting system and security logs, and has logging requirements. Specifies the audit trail history retention for at least one year, with a minimum of three months immediately available for analysis.
The Sarbanes-Oxley Act of 2002 (SOX) sets requirements for US public company boards, management, and accounting firms. COSO and COBIT are frameworks used by IT organizations to comply with SOX. Specifies retaining audit logs for up to seven years.
The California Consumer Privacy Act (CCPA)went into effect on January 1, 2020. It is meant to protect the privacy rights of California consumers. It regulates the use of data belonging to individuals, including internet activity, IP addresses, cookies, and biometric data. It also regulates “household data” generated by IoT devices in the home.
Develop a system for controlling deployments and monitoring changes, especially across clusters. After a change, compare the bill from one month to the next to reveal how changes affected the cost. Even those with years of experience using cloud storage find surprises on their bill. It’s hard to anticipate all the charges until everything is actually up and running.
Cloud providers know that understanding how data storage and use can be overwhelming, and each has tools to help show how the monthly bill is calculated. For example, AWS has an interactive Cost Explorer that walks through how the bill is calculated. It’s a good idea to closely examine the bills as changes are made to the storage configuration.
Monitor the environment to optimize the configuration
So now that the cloud storage system is up and running, how can performance be tracked? Having the right observability tools can provide precious visibility into a system that from the start is more opaque than a self-hosted option.
Track everything that happens. It’s happening in the cloud system with a log management tool optimized from the start for live streaming data and historical searches.
The engineers at Humio monitor the Humio SaaS infrastructure being used by customers storing data in the cloud. They use a variety of data from the stack. From one viewing pane, they track things like CPU, memory, and disk I/O being used by each pod. They monitor vital observability information about how each process is operating from metrics and logs generated locally and by the cloud provider. As the team makes changes to the stack to optimize performance, they run tests and track how changes to the system influence the overall cost from the cloud provider.
NOTE: Humio SaaS runs in Kubernetes, but relies on persistent log data. The engineers designed the platform to store all retained data in S3-compatible object storage. using ephemeral NVMe disks to cache data for in-memory processing. They move data from cloud storage to the ephemeral disk when it is needed. Find out how we developed this solution in a video featuring Grant Schofield at Cloud Native London in May 2020: Stateful in a Stateless Land.
Overcommit fast expensive disks.
Getting the most out of storage depends on the use case. If the use case places greater value on recent data, design the system to overcommit fast expensive disks and direct overflow to less expensive cloud storage. By maximizing the use of local disks through overcommitting, the cost of data transfer to storage is reduced, while maximizing the speed of operation — transferring files back from cloud storage can be significantly slower than keeping them on local disks.
While it’s useful to overcommit expensive disks, it’s important to protect against overflowing the local file system. Write data to the fast disk, and immediately make a copy in cloud storage. When the ephemeral disk is at 80% usage, start deleting local files that are known to be in cloud storage. Find a policy that allows the app to delete the files least likely to be needed soon on this node to avoid fetching them from the cloud right after the delete.
Use Block Storage to rewrite files.
If object storage is used, try to avoid editing files that are on the cloud very often. If that is required, network block storage, though more expensive, is going to be much faster. Unlike object storage, it allows the files to be edited without having to rewrite them in their entirety. Because of this editing hurdle, object storage is ideal for write-once, read-many applications. Take full advantage of this aspect of object storage by enabling the WORM feature (known on Amazon as S3 Object Lock), and the data will be prevented from being altered or deleted by intruders in the system. To see how this works on S3, read Protecting data with Amazon S3 Object Lock.3
Compress data before sending to cloud storage.
Compressed data saves on storage costs, improves performance, searches faster, and retains data longer. Speed things up even more to design the app to work with the data without decompressing it. If compression isn’t already being applied, take a look at Z-standard (https://github.com/facebook/zstd) for a very efficient implementation delivering better compression than older general-purpose compression algorithms.
Enable replication and remove other backups.
Once the switch is made to object storage, take advantage of backup replication services by turning on cross-region replication on AWS S3, Azure geo-redundant storage, or Google’s dual-region and multi-region storage. These replications cost money, but will likely save on network traffic and storage costs of running separate backup, so this makes them worthwhile. If replication services are being used, make sure to remove other backups to optimize performance.
Add additional data resilience by enabling versioning features. Versioning is useful in use cases that may involve some editing of files, but also need access to previous versions. Versioning prevents accidentally writing over files. When combined with replication services, data stored in object storage becomes more resilient than in block-storage systems offered by the main vendors. For example, enable AWS versioning in the Amazon S3 console.
Humio log management uses bucket storage to increase retention and save storage costs for its customers
Humio recently added support for bucket storage, unlocking lower storage costs while fitting neatly into Humio’s index-free architecture. The minimal and flexible design of bucket storage facilitates getting data into the system as quickly as possible, and provides streaming access to it. Bucket storage provides the potential for unlimited scalability of data retention in the cloud. Humio makes unlimited retention possible because its index-free structure is designed for streaming data, the same kind of data bucket storage was designed for.
Bucket data is ideal for Humio because it:
Works on cloud or self-hosted installations of Humio
Is optimized for write-once/read-many-times
Is not contingent upon editing files — it will accommodate unchanging log rules
Is appropriate for machine-based searching
Allows overcommitment of local disks, saving on hardware costs
Keeps data safe with built-in redundancies
For SaaS or self-hosted installations, Humio supports using bucket storage for as much data as needed. It can search even months-old data in less than a second, just like it does with real-time streaming data. Using bucket storage, Humio treats all data as live data.
When you run a search, active data is automatically moved to the NVMe drives — the memory and the CPU cache — depending on how frequently it is read. The engineers behind the technology explain how it works in this short video.
HUMIO USES CLOUD STORAGE TO MAKE ALL DATA LIVE DATA
Get started with Humio
Take a look at how fast Humio is using bucket storage, download a free trial to try storing, searching, and analyzing your organization’s log data. Set up a Humio free 30-day trial.
Have questions for Humio? Request a free 30-minute live demo.
Try Humio using a CloudFormation template
You can install Humio manually following the steps found in the Installation, but we also provide a couple of easy installation options for AWS.
If you want to ship logs from AWS CloudWatch to Humio, see the AWS CloudWatch Logs integration.
Quick Start – Single Node Trial
Use the following Launch Stack button to quickly try Humio on a new instance using a CloudFormation Template.Launch stack
The template will create an instance and a data volume, and start Humio in single-user mode.
When the template is done, you can click the output link to log into Humio – give it a few moments to start.
Log in using the
developer user and use the EC2 instance ID of the node running Humio as the password.
Humio will listen for HTTP traffic on port 8080, but behind a single-user login page. You can restrict access based on IP range if you want. For a production setup, we advise you to put an HTTPS proxy in front of Humio, or place it inside your VPC.
Choosing the right instance size depends on your ingest volume and usage patterns. As a general guideline the following table is a starting point for sizing your Humio instance.
Up to 15 GB/day: m4.large
Up to 35 GB/day: m4.xlarge
Up to 75 GB/day: m4.2xlarge
Up to 150 GB/day: m4.4xlarge
You can see the CloudFormation template on GitHub.
Humio is available directly from the AWS Marketplace.
Humio at Lunar: Logging everything to drive business value
Log management for a Kubernetes & cloud-native environment
"Just managing the servers for our old Elastic set-up costs more than the whole cost of using Humio."
David Højelsen, Co-Founder
Customer Case Studies
Join our Slack channel: meethumio.slack.com
You’ll also find lots of useful information on the Humio blog, and informative talks and demos on the Humio YouTube channel. To hear from Humio developers, customers, and partners, listen to our podcast series: The Hoot.
Humio's log management platform offers the lowest total cost of ownership, industry-leading unlimited plans, minimal maintenance and training costs, and remarkably low compute and storage requirements. Humio is the only log management solution that enables customers to log everything to answer anything in real time — at scale, self-hosted or in the cloud. Humio's modern, index-free architecture makes exploring and investigating all data blazing fast, even at scale. Founded in 2016, Humio is headquartered in London and backed by Accel and Dell Technologies Capital.
Get alerted to new How To Guides or get a PDF of this one
IDC/Seagate: The Digitization of the World from Edge to Core, Nov 2018, David Reinsel, John Gantz, John Rydning
Gartner: How to Escape Network and Local File Storage, Jan. 7, 2020, Lane Severson
AWS: Protecting data with Amazon S3 Object Lock, Sep 5, 2019, Ruhi Dang
AWS: How do I optimize the performance of my Amazon EBS Provisioned IOPS volumes?, Mar 31, 2020
Gartner: Replace Assumptions With Accurate Estimates Before Migrating to Public Cloud, Feb 28, 2019, John McArthur (Gartner subscription required)