DEV Community: Axelspace

Saved EBS Costs by Cleaning Up 3 TiB of Duplicate Data in InfluxDB v1

Ryosuke Hara — Wed, 28 May 2025 07:33:39 +0000

Hi, I’m rhara, a software engineer at Axelspace.

In this post, I’ll share how we reduced the size of an InfluxDB OSS v1 EBS volume from 5.9TiB to 2.9TiB by removing duplicate data. Since I couldn’t find much information on this process, I’m writing this as both a record and a reference.

Note: This method requires creating a new InfluxDB database with a different name, meaning the original database name cannot be retained.

To restore the original name, additional steps such as using the SELECT * INTO clause again are required.

What is InfluxDB?

InfluxDB is a time-series database optimized for storing and querying time-based data, especially in use cases where high write and read throughput is important—such as log aggregation. (Docs)

InfluxDB at Axelspace

At Axelspace, we use InfluxDB to store satellite telemetry data, such as power consumption data. While InfluxDB v2 is the mainstream version today, we've continued using the OSS version of v1 in Docker on an EC2 instance since our first satellite, GRUS-1A, was launched in 2018.

Over time, we noticed that some of the data had accumulated with duplication, leading to a bloated EBS volume. To reduce storage costs, we decided to clean up the duplicate data and migrate to a smaller EBS volume.

InfluxDB Cleanup

How to Reduce EBS Costs?

Since EBS pricing depends on storage size, reducing the volume size can lower costs. However, AWS does not allow shrinking EBS volumes directly.

Following this AWS article, we opted to create a new, smaller volume and replace the existing one.

Cleaning Up Duplicate Data in InfluxDB

Our first idea was to use InfluxDB's DELETE command to remove duplicate data directly from the existing volume, then copy the cleaned data to the new volume.

However, as noted in the official documentation, DELETE only allows specifying data to delete by timestamp or tag value—making it unsuitable for fine-grained removal of duplicates.

We considered several alternatives:

Use the SELECT * INTO clause provided by InfluxQL
Use Flux queries
Export and deduplicate data with Telegraf, then re-ingest

We ultimately chose Option 1: SELECT * INTO.

This clause allows flexible data selection, including filtering or dropping fields, and writing the result into a new database—ideal for deduplication.

We ruled out Option 2 because some query was not implemented, and Option 3 because it required full export and re-import of data.

Challenges with `SELECT * INTO`

One downside is that SELECT * INTO creates a new copy of the data, temporarily increasing total data size. To avoid enlarging the existing EBS volume, we used the new volume as the destination for the copied data.

Also, since new telemetry data was being written during the cleanup, we had to ensure that InfluxDB could remain online and that memory usage wouldn’t spike. We processed data in small time windows (e.g., one or three days) to keep memory usage manageable.

Step-by-Step Procedure

As mentioned, the core idea is to move data to a new smaller EBS volume using SELECT * INTO, which cleans the duplicated data by its own functionality and then replace the volumes.

This required making both volumes accessible to a single InfluxDB process, which took some workarounds.

We followed these five steps:

Set up a smaller EBS volume
Symlink both old and new volumes into the InfluxDB directory structure
Execute SELECT * INTO to deduplicate and copy data
Copy any additional data from old to new volume
Replace the old volume with the new one

1. Prepare the New EBS Volume

We created a new EBS volume to hold the cleaned data. We refer to this as the new volume and the currently used one as the old volume.

Due to EBS limitations, we couldn't shrink the old volume, so our goal was to replace it.

How to attach a volume is covered in the official AWS documentation.

2. Using Both Volumes in InfluxDB

InfluxDB OSS v1 doesn't natively support splitting storage across multiple volumes.

So, we had to take steps not documented officially.

2.1 Using Symbolic Links

By default, InfluxDB stores data in /var/lib/influxdb/ with three subdirectories: data, wal, and meta. (Docs)

Since each database has its own subdirectory, we could symlink only the new database's directories:

ln -s /mnt/new_ebs/wal/new_db /var/lib/influxdb/wal/new_db
ln -s /mnt/new_ebs/data/new_db /var/lib/influxdb/data/new_db

In this explanation, we use /mnt/new_ebs as the mount point for the new volume.

Note: Symlinks don’t always work reliably in InfluxDB (the link discusses v2), so data may appear missing temporarily.

To avoid this, you can configure InfluxDB to directly use the real path rather than through symlinks.

NOTE: Using bind mounts instead of symlinks may be more robust:

(https://community.influxdata.com/t/how-to-move-var-lib-influxdb-to-a-different-location/30163/2)

2.2 Create the New Database

After setting up the symlinks, run the following InfluxQL command to create the new database:

CREATE DATABASE new_db

3. Copy and Deduplicate Data via `SELECT * INTO`

We ran Command like this:

SELECT * INTO new_db FROM old_db WHERE ...

You can also specify tags or fields instead of using *.

This was the most time-consuming step—it took months to process about 5 TiB of data spanning several years.

4. Copy Additional Data to New Volume

Some remaining steps:

Copy /var/lib/influxdb/meta/meta.db to /mnt/new_ebs/meta/meta.db
If there are other databases beyond old_db, copy them as well:

cp -avi /var/lib/influxdb/data/other_db /mnt/new_ebs/data
cp -avi /var/lib/influxdb/wal/other_db /mnt/new_ebs/wal

Using rsync is also a valid option.

5. Replace the Volumes

Finally, point InfluxDB to the new volume by adjusting mount points or configuration.

This can be done via the config file or environment variables (Docs).

Final Notes

While our main target was duplicate data, the SELECT * INTO clause offers flexibility to remove or transform data during migration.

Again, note that this approach does not preserve the original database name.

Since InfluxDB doesn't support renaming databases, if you must retain the name, you'll need to re-create the database with the same name and run SELECT * INTO back into it.

About Axelspace's security initiatives (towards achieving zero-trust)

mizu — Fri, 11 Apr 2025 06:13:21 +0000

Introduction

Hello! I’m mizu, a corporate IT and security engineer at Axelspace, a startup developing business in the satellite industry. It’s been two and a half years since I joined the company, and I wanted to take the opportunity to reflect on and share what I’ve worked on so far through this technical blog.

If you’re a security engineer at a company with 100+ employees and plans to scale rapidly, I hope our experience can serve as a helpful reference.

In this post, I’ll focus on our initiatives aimed at realizing Zero Trust security.

Challenges and Our Approach to Zero Trust

With remote work becoming the norm and SaaS adoption expanding, several security challenges came to light within our organization:

Security logs were siloed across products, making it difficult to gain a holistic view
SaaS usage across employees was unmonitored (Shadow IT)
Passwords were managed individually, lacking centralized control

To address these issues, we selected and implemented the following tools in a phased manner. The key themes of this post are “visibility” and “control.”

EDR: Endpoint Monitoring with CrowdStrike Complete

To strengthen endpoint security (PCs and servers), we replaced our legacy signature-based antivirus software with CrowdStrike Complete, a leading EDR (Endpoint Detection and Response) solution.

While many organizations already use CrowdStrike, we opted for the Complete plan, which includes MDR (Managed Detection and Response). This allows CrowdStrike’s security analysts to handle triage and initial response actions, enabling swift and efficient incident handling.

Changes and highlights after implementation:

Visibility into all endpoint activities, including non-incidents, greatly improved traceability
Triage and initial containment are handled by CrowdStrike MDR, significantly reducing operational burden
Even complex threats like Living-off-the-Land (LotL) attacks can be detected and responded to effectively

[Operational Note]

Many companies face a common issue after implementing EDRs: “We bought it, but can’t operate it.” By including MDR with our EDR, we significantly reduced alert fatigue and operational strain.

In my experience, incidents occurring several times a month are now resolved in under an hour each. If you’re struggling with EDR operations, MDR might be a solution worth exploring.

CASB/SWG: Shadow IT Visibility and Web Access Control with Netskope

We implemented Netskope as our CASB/SWG solution. This gave us the ability to log and control user access to SaaS and web services over HTTPS via endpoint-based agents.

Unlike traditional UTM or IDS tools, which can’t inspect HTTPS traffic, Netskope’s endpoint agent decrypts SSL/TLS traffic and allows inspection of HTTP methods, file names on Google Drive, and more.

Changes and highlights after implementation:

Shadow IT usage by employees became visible (Visibility)
We gained policy-based control over risky services like online storage, chat tools, and social media (Control)
Access via personal Gmail or Outlook accounts can now be blocked (Control)

[Operational Note]

Due to HTTPS decryption, Netskope sometimes inadvertently blocks traffic from development environments. We maintain an exception list for specific executables and destination URLs to avoid interfering with development.

For now, we only use CASB/SWG function and have chosen not to implement NPA (VPN alternative) for several reasons.

Password Manager: Centralized and Shared Management with Keeper

We introduced Keeper Security as our password manager to replace free software and browser-based storage. This enables secure password sharing*, auto-fill, and generation capabilities.

We also subscribed to the BreachWatch add-on, which provides dark web monitoring, weak password detection, and scoring by user and organization.

* While password sharing should generally be avoided, there are still unavoidable scenarios where shared accounts are required.

Password Manager migration illustration

As info-stealer malware that extracts browser-stored credentials remains prevalent, password managers like Keeper are also effective against malware-related threats.

Changes and highlights after implementation:

Standardized password management, policies, and sharing methods (Control)
Improved operational efficiency via auto-fill
Admins can now monitor password breaches, weak usage, and scoring (Visibility)

[Operational Note]

Keeper supports importing from various tools (Keepass, Google Password Manager, etc.), which made the migration smooth and user-friendly.

We didn’t mandate usage and started with a small license count to reduce initial costs. Expanding Keeper’s usage across the org is our next goal.

SIEM: Centralized Logging and Visualization with Elastic Cloud

To manage and visualize logs from all our security tools, we implemented Elastic Cloud as our SIEM platform. By integrating logs into Elastic Cloud, we enhanced visibility, centralized monitoring, and enabled long-term storage.

Notably, our CrowdStrike plan does not retain event logs long-term, so we strongly felt the need to centralize them in a SIEM for traceability during incidents.

Elastic integration illustration

Changes and highlights after implementation:

Logs from various tools can now be viewed directly on Elastic Cloud (Visibility)
Activity across users, devices, and email addresses can be traced across tools (Visibility)
Dashboards provide executives with visual summaries of security incidents and risk scoring (Visibility)

[Operational Note]

Since Elastic Cloud is SaaS-based, we could skip infrastructure setup and begin operations quickly. However, proper configuration of ILM (Index Lifecycle Management) is still required to manage data retention.

Fortunately, the Elastic Support Assistant (AI chatbot) was very effective in resolving issues.

We plan to cover the selection process and comparison with other SIEM tools in a future article.

Closing Thoughts

While we’ve successfully deployed these solutions, I’d say we’re at about 50% when it comes to fully utilizing their features and implementing the right policies.

Going forward, we aim to deepen our understanding of these tools and expand improvements to previously postponed security areas.

Some of our next steps include:

ZTNA: Building a secure, VPN-less network
ASM: Risk evaluation and vulnerability assessment of IT assets
Security Education: Phishing simulations and security e-learning programs

We are hiring!!

Axelspace is actively hiring across multiple roles, and we're especially looking for security engineers!

We’d love to hear from you if any of the following apply:

You’re interested in the space industry
You want to work at a fast-growing startup
You want to have autonomy and impact in a rapidly scaling company

If you're curious, let’s start with a casual chat!
https://hrmos.co/pages/axelspace/jobs/3000000006

DEV Community: Axelspace

Saved EBS Costs by Cleaning Up 3 TiB of Duplicate Data in InfluxDB v1

What is InfluxDB?

InfluxDB at Axelspace

InfluxDB Cleanup

How to Reduce EBS Costs?

Cleaning Up Duplicate Data in InfluxDB

Challenges with SELECT * INTO

Step-by-Step Procedure

1. Prepare the New EBS Volume

2. Using Both Volumes in InfluxDB

2.1 Using Symbolic Links

2.2 Create the New Database

3. Copy and Deduplicate Data via SELECT * INTO

4. Copy Additional Data to New Volume

5. Replace the Volumes

Final Notes

About Axelspace's security initiatives (towards achieving zero-trust)

Introduction

Challenges and Our Approach to Zero Trust

EDR: Endpoint Monitoring with CrowdStrike Complete

CASB/SWG: Shadow IT Visibility and Web Access Control with Netskope

Password Manager: Centralized and Shared Management with Keeper

SIEM: Centralized Logging and Visualization with Elastic Cloud

Closing Thoughts

We are hiring!!

Challenges with `SELECT * INTO`

3. Copy and Deduplicate Data via `SELECT * INTO`