Unlocking Fleetspeak Large Scale and Reliability with Cloud Spanner

Unlocking Fleetspeak Large Scale and Reliability with Cloud Spanner

Authored by Frank Tobia, Ike Okoro, Matt Pfeiffer and Dan Aschwanden

If GRR and Fleetspeak are foundational tools to do Digital Forensics and Incident Response (DFIR), you likely have experienced its database becoming a key challenge as you scale towards tens of thousands of endpoints. The database handles the constant flow of messages, tracks client states, and manages operational data, and under heavy load, performance and stability can become significant concerns. In this article we introduce a fundamental update that elevates Fleetspeak's datastore infrastructure, making it ready for the most demanding, large-scale deployments with the addition of Google Cloud Spanner as a new datastore option.

Fleetspeak is GRR’s communication layer. The Fleetspeak frontend (server) exchanges Protocol Buffer based messages with Fleetspeak agents (clients) on behalf of GRR, providing GRR with a communication conduit. It is crucial for operating GRR at large-scale that Fleetspeak performs smoothly to cater to any client fleet size that your GRR deployment might require.

Why Spanner is a Breakthrough for Fleetspeak

Spanner isn't just another database; it's Google Cloud's globally distributed, strongly consistent database service. It's built from the ground up for massive scale and high availability, making it an ideal fit for the rigorous demands of a large Fleetspeak deployment. For Fleetspeak, this integration means saying goodbye to some of the most significant scaling bottlenecks.

Here’s the high-level view of why Spanner is a game-changer for Fleetspeak:

  • Exceptional Performance: Spanner is engineered to handle immense throughput. This means that even with the hundreds of thousands of messages common for a large fleet, Fleetspeak can efficiently ingest and process data, leading to quicker client interrogations and faster command execution. It is a step change in how responsive your GRR client management becomes.
  • Bulletproof Stability & Reliability: In incident response, system stability is non-negotiable. Spanner's core architecture provides inherent high availability and robust disaster recovery capabilities. This gives Fleetspeak a dramatically more reliable foundation, letting you worry less about database bottlenecks or even outages impacting your critical operations.
  • Scale to 50,000+ Clients: This integration enables the capability to support GRR/Fleetspeak deployments scaling up to 50,000 clients and beyond. This is an important improvement for large organizations that previously hit scaling limits with traditional databases like the MySQL based datastore that Fleetspeak supports for a long time. Note that Fleetspeak’s MySQL datastore option (either on CloudSQL or self-hosted) is not going away and is still available to you in case you operate GRR on a smaller scale.

Under the Hood: The Engineering Effort

Adding Spanner as a datastore involved several upgrades within the Fleetspeak datastore codebase. This included:
  • Flexible Configuration: The config.proto file was updated to include Spanner as a first-class, configurable datastore option. This allows Fleetspeak deployments to easily select Spanner via their components_config (see a sample configuration snippet below). The core components.go file was refactored to handle selecting either MySQL (the old default) or the new Spanner based datastore configuration.
  • Optimized Schemas: New database schemas were designed specifically to leverage Spanner's unique capabilities, ensuring efficient data storage and retrieval at scale.
  • Core Datastore Refactoring: Critical components responsible for data persistence – the stores – were substantially updated. This involves new implementations for clientstore.go, filestore.go, messagestore.go, and broadcaststore.go to reliably interact with the Spanner datastore for managing client data, file metadata, messages, and broadcasts.

components_config {


  spanner_config {

    project_id: "YOUR_GOOGLE_CLOUD_PROJECT_ID_HERE"

    instance_name: "fleetspeak-instance"

    database_name: "fleetspeak"

    pubsub_topic: "YOUR_GOOGLE_CLOUD_PUBSUB_TOPIC_HERE"

    pubsub_sub: "YOUR_GOOGLE_CLOUD_PUBSUB_SUBSCRIPTION_HERE"

  }


  https_config {

    ...

  }


  admin_config {

    ...

  }

}


This extensive update is visible across the commits in PR #561, which was successfully merged on March 20, 2025.

Getting Started with Fleetspeak and Spanner

Ready to deploy Fleetspeak with Spanner? The process is well-defined and starts with preparing your Google Cloud environment:

  1. Create Your Spanner Instance: You'll need to create a Spanner Instance in your Google Cloud project before setting up Fleetspeak. The setup.sh script within the Fleetspeak repository will then handle creating the default fleetspeak Database and its necessary Tables within that datastore.
  2. Set up Pub/Sub: Fleetspeak's Spanner integration also relies on Google Cloud Pub/Sub. For this reason you'll need to create a dedicated Pub/Sub Topic and Subscription. This system ensures that backlogged messages are processed efficiently by triggering the ProcessMessages() method.
  3. Configure Fleetspeak: Update your Fleetspeak components_config file (see the sample snippet above). You'll need to provide your Google Cloud Project ID, Spanner Instance and Database names, and the names of your Pub/Sub Topic and Subscription. This configuration points Fleetspeak to your new, powerful Spanner datastore implementation. Note that the configuration now requires that you remove the MySQL datasource name (mysql_data_source_name) in favour of the Spanner configuration in lieu.

Google Cloud Resources

You will have to provision the Spanner and Pub/Sub Google Cloud resources for each GRR/Fleetspeak deployment:


These resources require that the Spanner API and the Pub/Sub API are enabled in the Google Cloud Project that runs the Google Kubernetes Engine (GKE) that hosts the Fleetespeak workloads.


To access both the Spanner and the Pub/Sub resources the Fleetspeak workloads on GKE will require the following roles:


We recommended that the roles be granted leveraging Workload Identity Federation for GKE (WIF). Note that WIF allows for narrowing the scope of the assigned roles to both Kubernetes Service Accounts and Kubernetes Namespaces. This allows for narrow control over the permissions and enables a focused least privilege approach to assign the required permissions.

Seamless Development and Testing with Emulators

One of the developer-friendly aspects is the support for Spanner and Pub/Sub emulators. This allows you to fully develop and test your Fleetspeak setup with Spanner locally. No need for live Cloud connectivity or costs during development or testing cycles! A convenient docker-compose.yaml is also provided to easily spin up these emulators and run the Fleetspeak test suite locally. The build.yml workflow in the Fleetspeak GitHub repository itself utilizes these emulators for testing the Spanner integration (see the build-test-linux step for more details).

This major enhancement, finalized with the merge of PR #561, marks a new chapter for large-scale Fleetspeak deployments. It promises a significantly smoother, more performant, and fundamentally more scalable and stable experience.

For more detailed technical information on setting up and configuring Spanner for Fleetspeak, refer to the SPANNER.md documentation file in the Fleetspeak GitHub repository.

And last but not least, for a podcast version of this blog post you can listen in here.

Stay tuned as we continue to explore these powerful tools! We are also working on porting the GRR datastore to Spanner to allow your GRR instance to scale to even larger client fleets.

As always, the GRR user group is the place for questions and collaboration.

You can also revisit our previous posts for related GRR/Fleetspeak content:

The links

Comments

Popular posts from this blog

Parsing the $MFT NTFS metadata file

Incident Response in the Cloud

Container Forensics with Docker Explorer