The Indestructible Blob - Humio on Azure Object Storage
Use Humio with MinIO to store data with Azure Blob storage
July 30th, 2020
As Humio deployments have become more numerous and varied over recent weeks and months, there has been an increasing demand for Humio to support Azure Blob Storage for its long-term persistent data. This article discusses exactly why you would want to do that, but more importantly how to actually get it working (and working well!).
What is Object Storage anyway?
Firstly some background: what is object storage and why would Humio want to use it anyway? Let's turn to Wikipedia for a definition:
Object storage … is a computer data storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manages data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks.1
Or in other words, we treat the data we want to store as discrete immutable blobs of data and can forget all about filehandles, filesystems, etc. You send data to create objects, you retrieve objects to get data. In summary, object storage is a very useful abstraction and simplification of storing, managing, and retrieving data at any scale.
There are a few other critical features of object storage, and these, in particular, are of interest when running a platform like Humio:
Object Storage is essentially unlimited in capacity; no more concern about running out of storage and the consequences of a disk hitting 100%.
Durability is someone else's problem; once Humio has sent data to object storage, the problem of data durability is handed over.
Storage architectures are optimised; specialised object storage technologies can take advantage of much more sophisticated replication, redundancy, and resiliency configurations than an app that only sees attached storage volumes.
This means that for Humio, where we already see outstanding data compression rates, we can further optimise the infrastructure footprint required to provide the service. By how much? That depends on the object storage technology chosen, but a good object storage system has the potential to reduce your raw storage capacity requirement by over 30% compared to “dumb” data replication across multiple drives.
30% doesn’t sound like much? Let's run through a common scenario:
As an organisation ingesting 10 TB/day of data into Humio and wanting to keep that data available for 7 years, you can expect the following storage requirements:
Typical Humio compression factor can be 15:1; so new data per day requires additional persistent storage of around 660 GB/day
With a target replication factor of 2, the naive storage requirement is an additional 1.3 TB/day
If the data is backfilled, or by the end of year 7, the storage requirement is around 3.3 PB (1.7 PB of unique data, with a copy)
If the Object Storage is distributed across 4+ zones and making use of parity or erasure data durability techniques, that 1.7 PB of data can be provided, with high durability, with closer to 2.4 PB of raw storage. That potentially saves around 1 million GB of storage AND offers a better storage service to the application.
The other thing to keep in mind about storage, and especially object storage, is that these objects are not like vegetables; they don't go bad with age. With Humio, when it comes to searching logs that are 5 weeks old or 5 years old, the performance is the same; the cluster retrieves the relevant segment files (objects) and conducts the search.
Why isn’t Azure Blob storage directly integrated?
Humio also uses Alpakka, which is an open-source project for building data-streaming pipelines based on the Akka library Streams. Long story short: Aplakka does not directly support Microsoft Azure Blob Storage, so Humio does not support it directly either.
But does that have to be the end of the story? It does not, because if Azure Blob Storage can be made to look, swim, and quack like AWS S3 then we have options. To achieve this we turn to MinIO and its Azure Gateway mode of operation.
MinIO to the rescue
MinIO is an excellent project that allows organisations to make the most of object storage in many different modes of operation. We documented how to integrate Humio with MinIO directly if that’s something you want to do, for example, building your own object storage service on-prem.
The MinIO Azure Gateway mode allows Microsoft Azure Blob storage to be presented with an AWS S3-compatible interface, something that Humio supports. Reviewing the documentation for MinIO in this mode indicates there are some limitations — fortunately, none of these apply to the way Humio makes use of object storage.
Our target architecture therefore becomes:
MinIO also has some really interesting additional potential benefits (although these are subject to further testing). Specifically, MinIO can be used as an edge cache, potentially offering even greater performance enhancements over a direct connection to Azure blob storage.
At this point, it’s worth taking a few minutes to review the full MinIO marketing materials for this gateway mode: https://min.io/solutions/azure-s3-api-integration.
So how to deploy this?
Deployment is straightforward from the Humio point of view; in fact, it is identical to a standard MinIO based configuration as outlined in the Humio documentation.
For testing, Humio is running in a docker container with the following environment configuration (the important bits are in bold).
MinIO when running in gateway mode is stateless and ephemeral, leaving a very straightforward configuration and a great deal of flexibility on how to scale the gateway if needed. The MinIO docker container was run with the following configuration:
docker run \ -d \ -p 9000:9000 \ --name minio-azure \ -e "MINIO_ACCESS_KEY=azure-storage-account-name" \ -e "MINIO_SECRET_KEY=azure-storage-key" \ -e "MINIO_AZURE_CHUNK_SIZE_MB=1" \ minio/minio gateway azure
The Humio configuration was also straightforward:
HUMIO_JVM_ARGS=-Xss2M AUTHENTICATION_METHOD=single-user SINGLE_USER_PASSWORD=password # Use MinIO as S3 compatible storage S3_STORAGE_ENDPOINT_BASE=http://172.17.0.3:9000 S3_STORAGE_PATH_STYLE_ACCESS=true BUCKET_STORAGE_IGNORE_ETAG_UPLOAD=true # S3 Bucket storage S3_STORAGE_ACCESSKEY=azure-storage-account-name S3_STORAGE_SECRETKEY=azure-storage-key S3_STORAGE_BUCKET=humio-dev S3_STORAGE_REGION=ignored S3_STORAGE_ENCRYPTION_KEY=thisissomesecretkeythatshouldbemoresecret # Don't overfill local storage BETATESTING_LOCAL_STORAGE_PERCENTAGE=80 LOCAL_STORAGE_PERCENTAGE=80 USING_EPHEMERAL_DISKS=true
Breaking down some of the options being used here:
This is the URL to the MinIO gateway instance. In this case, MinIO is also running under Docker, so this is the IP address of the container (this isn’t really how you want to do this in Docker, but it works for this example).
These are the Azure Storage keys for accessing the Azure Blob storage. When configuring MinIO in Azure Gateway mode, it directly passes through these authentication tokens (more on that in a second).
What do we see in Azure?
Once this is started, the following data structures will be visible in Azure:
Azure also gives us some really good insights into the performance of the storage over time:
Conclusion and next steps
MinIO offers an excellent solution to provisioning Humio in Azure and making direct use of Azure Blob storage. The performance is excellent so far in testing, and more formal benchmarking will take place. Watch this space!
1. https://en.wikipedia.org/wiki/Object_storage (recovered 28th July 2020)