Optimize the stack with cloud storage
Optimize the stack with cloud storage
The amount of data being created by businesses is growing at an exponential rate. IDC predicts that there will be 175 zettabytes of data by 2025, up from an estimated 50 zettabytes in 2020.
Data growth is outpacing the ability to store data locally. Gartner reports that “by 2025, 99% of all files will be stored in cloud environments. Employees will create, collaborate on, and store files from any device, without knowing if the files are stored locally or in the cloud.” 2
Cloud providers offer durable storage that scales nearly infinitely. Data can be retrieved from the cloud nearly as quickly as it can from local disks, depending on the configuration. A single API integrates storage into applications, making storage less expensive and more convenient.
Data stored in the cloud is an attractive solution for many users, yet there still remains a degree of financial, performance, and security pitfalls if it isn’t configured correctly.
Humio engineers recently migrated the storage for its SaaS log management platform from private network block storage to cloud object storage, gaining insights on the way. This document is created to point out pitfalls and offer suggestions to avoid these errors.
This article will show how to optimize cloud storage to:
Overcommit local disks
Extend retention time
Eliminate the need for replication and redundancy
Keep data safe from intruders and accidents
Types of cloud storage
There are three main types of data storage available from major cloud providers: object storage, file storage, and block storage. Each has unique benefits and tradeoffs that are worth understanding.
This guide shows how Humio engineers migrated the storage for its SaaS log management platform from private network block storage to cloud object storage. We share the following 5 steps as a way to get started, with more detailed information later in the page.
Improvements in speed, scale, and ease of use of cloud storage is making it an increasingly attractive option, especially as organizations scale out their digital business initiatives. But there are challenges that need to be addressed, especially when balancing the goals of optimizing costs while maximizing performance and security.
From the start, cloud object storage can have significantly lower costs than local storage. Yet even with its inexpensive brates, users who move large volumes of data from one cluster to another can incur significant additions to their month-end bills. Some cloud providers charge for accessing data, especially when it’s stored in “cold” storage.
Cloud networks are optimized for fast rates of data transfer, but they don’t have the fastest storage disks. There are options provided that can overcome speed issues, but they come at a cost.
With digital business initiatives like the Internet of Things (IoT), volumes of data are growing exponentially. This data will be generated and processed at the edge, but it will likely be stored in the cloud.
Once data is stored in the cloud, there are concerns that the security of that data can’t be completely controlled from the end-users perspective.
There are some regulations that may affect what data can be collected, and how files are stored. For example, GDPR states that organizations can only collect personal data that is clearly related to a well-defined business objective. The recently enacted California Consumer Privacy Act (CCPA) protects consumer privacy rights. It regulates data belonging to individuals, including internet activity, IP addresses, cookies, and biometric data. It even regulates “household data” generated by IoT devices in the home.
Cloud storage solutions come with varying levels of complexity. Optimizing for cloud storage often requires developer skills.
Making data accessible to applications the performance users expect can be challenging. This can lead to complicated applications and higher costs of managing and maintaining them.
5 Steps to optimize a stack with cloud storage
Each major cloud provider has several types of cloud storage available. The main types of storage are listed above (file, block, object), but looking at the offerings from major cloud providers can be overwhelming. Discussions of the main benefits usually focus on the costs of cloud storage, but it’s not a simple choice. There are options for storage amount, format, speed, region, availability, bandwidth, access, latency, and even SLA.
The best approach to making a choice is to focus on the type of cloud storage that is appropriate for each use case. This may take some time to determine, even though the initial choice seems obvious.
Be prepared to answer questions about use cases and requirements before looking for pricing. For example, these are just some of the variables used to determine the appropriate kind of object storage.
Storage format (file storage, block storage, object storage)
Storage class (for example, standard, tiered, infrequent access, cold storage, deep storage, etc.)
Speed and latency requirements
Location of data storage. There may be dozens of regions where data may be stored.
Size of storage
How long objects are stored during the month
Before choosing a cloud provider or storage type, make sure to understand all the costs associated with the use cases. Gartner reports that cost overruns will continue to be a problem for years to come.2
Balance storage features with costs
By design, cloud storage is built to scale infinitely with the number of files and the size of those files. The cost varies based on how much is stored in the cloud, but the architecture can be stretched indefinitely.
Determine if tiering is an option
For cloud providers that offer different prices for different levels of access, tiering services may be an option. Users set policies that move aging data after a specific period. This can be a benefit for storing rarely-used data for archival purposes. Be aware that there are charges for automatically moving data based on policies, and accessing the data.
Look closely at additional costs
Understand how baseline commitments can affect costs, and weigh the costs of effective capacity and provisioned capacity. Here are a few examples of additional costs that may be accrued.
Monitoring and automation fees for tiered storage, and per-request ingest fees when moving objects to a storage class.
Retrieval per GB from infrequent access storage.
Data transfer in and out of the cloud
Copies of data in an instance
Minimum charges apply for size or duration for some configurations.
To make things easier, use pricing calculators provided by major cloud providers.
Reduce costs by removing local data backups
Cloud storage users may not have to worry about having separate backups for data storage. Most types of cloud storage services automatically replicate data in multiple locations.
Look for ways to compress data before moving it or storing it to the cloud. Compression is an engine that drives savings across every part of the system. Cloud object storage is already an inexpensive option for storage, and yet compression makes it significantly cheaper. Savings are realized as data is moved across the stack, and as the volume of data scales. Compression will save on storage costs, save on data transfer costs, and save time by speeding up all processes.
Over-commit for infinite local storage
Manage which files are kept on the local file system based on the amount of disk space used, and delete local files that also exist in cloud storage. This allows more files than the local disk has room for, allowing for infinite storage of events. There are no technical limitations in this over-committing scenario. The only limits are paying for additional cloud storage and potential transfer costs when the files required for a search are not present locally.
Use ephemeral NVMe storage for low latency
Using NVMe SSD disks will be dramatically faster than network-attached HDD storage. Traditional SATA storage devices use one command queue for up to 32 commands, NVMe supports 65,536 queues with 64K commands in each. Note that Using network-attached disks in combination with cloud storage is discouraged in cloud environments because it often results in using much more network bandwidth.
It’s important to keep in mind that the main concern of cloud service providers is uptime, and the security of the data comes second to them. It’s important to proactively manage the security of the data.
Security is a shared responsibility
While it may seem like cloud storage providers should be responsible for making sure data they store is compliant with regulations, they will be the first to say that compliance is a shared responsibility. Ultimately, the data being stored belongs to the organization and its customers.
The customer is responsible for keeping customer data secure, determining access and permission to data, keeping its platforms and applications secure, and maintaining a secure network and operating environment. They are also responsible for data encryption and data integrity, and protecting networking traffic.
Look out for open file shares
Open file shares and object storage are among the most common security risks of data in the public cloud. But they are also the easiest to prevent. Make sure that appropriate permissions are in place for anything online.
Encrypt data sent to cloud storage
Encrypt copies of data sent to cloud storage with AES-256 encryption while uploading. This ensures that even if read access to data is accidentally allowed, an attacker can’t read any events or other information from the data while in transit. When using a public cloud, no one at the cloud provider can look at the data.
Consider using an encryption key based on the seed key string set in the configuration. Each file gets encrypted using a unique key derived from that seed. The seed key is stored in a global file along with all the other information required to read and decrypt data contents.
Pay close attention to compliance considerations.
Organizations need to ensure that they are complying with every data security and privacy regulation. Consider how the organization ensures transparency and gives the owner of the data control of its use. Make sure that users understand the types of data collected and how it will be used. Consider how regulators and partners can be shown that the data collected and stored meets regulatory requirements.
The organization must stay informed about what is required for all types of data that is collected and stored. In many cases, there are specific requirements for creating policies for data governance, keeping data secure, protecting consumer data, and retaining records of compliance for auditing.
Here are a few examples of regulations that have significant impact on data collection and storage.
The General Data Protection Regulation (GDPR) is a European framework that was created to protect security and privacy for Personally Identifiable Information (PII). GDPR applies to any legal entity which stores, controls, or processes personal data for EU citizens.
The Health Insurance Portability and Accountability Act (HIPAA) pertains to organizations that transmit health information in electronic form in the United States. The HIPAA Security Management Process requires organizations to perform risk analysis, risk management, have a policy for data breaches, and conduct Information System Activity Reviews. Compliance data should be retained for up to six years.
The Payment Card Industry Data Security Standard (PCI DSS) intends to secure credit cardholder data from theft and misuse. There are 12 security areas for enhanced protection for data. It requires collecting system and security logs, and has logging requirements. Specifies the audit trail history retention for at least one year, with a minimum of three months immediately available for analysis.
The Sarbanes-Oxley Act of 2002 (SOX) sets requirements for US public company boards, management, and accounting firms. COSO and COBIT are frameworks used by IT organizations to comply with SOX. Specifies retaining audit logs for up to seven years.
The California Consumer Privacy Act (CCPA)went into effect on January 1, 2020. It is meant to protect the privacy rights of California consumers. It regulates the use of data belonging to individuals, including internet activity, IP addresses, cookies, and biometric data. It also regulates “household data” generated by IoT devices in the home.
Develop a system for controlling deployments and monitoring changes, especially across clusters. After a change, compare the bill from one month to the next to reveal how changes affected the cost. Even those with years of experience using cloud storage find surprises on their bill. It’s hard to anticipate all the charges until everything is actually up and running.
Cloud providers know that understanding how data storage and use can be overwhelming, and each has tools to help show how the monthly bill is calculated. For example, AWS has an interactive Cost Explorer that walks through how the bill is calculated. It’s a good idea to closely examine the bills as changes are made to the storage configuration.
Humio log management uses bucket storage to increase retention and save storage costs for its customers
Humio recently added support for bucket storage, unlocking lower storage costs while fitting neatly into Humio’s index-free architecture. The minimal and flexible design of bucket storage facilitates getting data into the system as quickly as possible, and provides streaming access to it. Bucket storage provides the potential for unlimited scalability of data retention in the cloud. Humio makes unlimited retention possible because its index-free structure is designed for streaming data, the same kind of data bucket storage was designed for.
Bucket data is ideal for Humio because it:
Works on cloud or self-hosted installations of Humio
Is optimized for write-once/read-many-times
Is not contingent upon editing files — it will accommodate unchanging log rules
Is appropriate for machine-based searching
Allows overcommitment of local disks, saving on hardware costs
Keeps data safe with built-in redundancies
For SaaS or self-hosted installations, Humio supports using bucket storage for as much data as needed. It can search even months-old data in less than a second, just like it does with real-time streaming data. Using bucket storage, Humio treats all data as live data.
When you run a search, active data is automatically moved to the NVMe drives — the memory and the CPU cache — depending on how frequently it is read. The engineers behind the technology explain how it works in this short video.
Customer Case Studies
Join our Slack channel: meethumio.slack.com
You’ll also find lots of useful information on the Humio blog, and informative talks and demos on the Humio YouTube channel. To hear from Humio developers, customers, and partners, listen to our podcast series: The Hoot.
Humio's log management platform offers the lowest total cost of ownership, industry-leading unlimited plans, minimal maintenance and training costs, and remarkably low compute and storage requirements. Humio is the only log management solution that enables customers to log everything to answer anything in real time — at scale, self-hosted or in the cloud. Humio's modern, index-free architecture makes exploring and investigating all data blazing fast, even at scale. Founded in 2016, Humio is headquartered in London and backed by Accel and Dell Technologies Capital.