Troubleshooting Plaso Issues - Memory Edition

TL/DR:
This blog post describes my thought process when debugging an issue that came up in plaso. It is just meant as an easy reading for people that may come across with issues in the tool and want to start debugging them.

Longer version:
Despite plaso of course being close to perfect, as everyone clearly knows, there comes times when issues arise and the time to do some debugging comes upon us.

We've tried to make some of it easier by building debugging capabilities into the tool and documenting some of this effort in our glorious troubleshooting page. However, sometimes these methods and tips may not be completely sufficient and I'm going to go quickly over a short exercise I just completed.

We've known for some time that the tool consumed quite a lot of memory, a bit too much for our own taste, yet we didn't really dig too deep into it until an issue got raised on our issue site (if this isn't an incentive to create an issue if you detect one, then I don't know what is).

The issue in question had to do with the tool's memory footprint, essentially boiling down to this lovely quote:
On a 20-core, 20GB machine Plaso has 11 of 17 workers that are consuming more than 500MB apiece, two peaking out at 1.2GB and totaling 14GB.
This got us to realize we needed to address this issue now, not later. This was becoming quite an issue. What we also realized is that we really didn't have a whole of options to profile the tool. There is of course the pprof.py tool yet that has it's limitations since it is designed to parse a single file (or profile the storage), and comes in handy when troubleshooting a slow parser or a plugin.

Then we've got our debug log entries, which again don't really help out here since there wasn't any issue with any of the parsers. The same goes with running the tool in a single process mode using debugging information (-d --single_process for those not familiar with it). What that does is to drop you into a Python debug shell, an immensely useful tool if you discover a crash. However there wasn't anything crashing here, the tool was behaving completely as designed, no crashes, no issues... it was just consuming way too much memory.

The first step we did in terms of troubleshooting this issue was to add some memory profiling capabilities into the tool. That code was pushed in here and provided us with the mechanism of seeing what each worker was storing in memory. The way to perform the memory profiling is simple enough:

log2timeline.py --profile --profile-sample-rate=5000 plaso.db image.raw

This will create a #.hpy file per worker, where # is the number of the worker.

Guppy has a built-in profile browser to view the .hpy files e.g.
from guppy import hpy
heapy = hpy()
heapy.pb('0.hpy')

Looking at the HPY files reveled among other things this:





That is we can see that each worker starts to "hord" unicode objects and event objects. Why would the tool do that? What would cause each worker to collect all these event objects? (the unicode objects are simply pickled event objects).

The obvious answer is the storage queue, we have several workers (N workers) that continuously work on parsing files and extracting events out of them, pushing each extracted event on a storage queue. Then there is a single storage process that monitors this queue. And not only is there only a single storage process it is also tasked with the enormous task of serializing the event objects as protobufs, a notoriously slow process in Python (and one that is easily seen by profiling a storage processing using pprof).

We were at this point pretty satisfied that we discovered where the issue was now. And there are few solutions to this problem:

  • Create more storage processes, that is to do more parallel processing on the storage
  • Make the storage process faster so that it can keep up
  • Create an upper limit on the number of events pushed to the storage queue to prevent memory buildup in each worker
Since the current design of the storage doesn't really allow scaling it up we were basically left with the option of a quick fix of creating upper limits on the queue length and then longer term goals of rewriting the storage layer.

That resulted in this code change. The creation of an upper limit of the number of events that can be pushed into the storage queue.

Quick testing revealed that the tool now consumes considerably less memory.... and thus we are done, right?

I still wasn't completely satisfied so I decided to make some more testing. This time I wanted to test for each read from the storage queue (by the storage process) the time and how many items are left in the queue. The idea was to graph that up and confirm our initial theory and also to measure the effect the previous upper limit change really had.

Thus I went ahead and create a very scientific research, where I would take a single test disk image I had lying around and run a modified version of the tool against it, both before and after the code change.

The disk image of my choice was taken from the SANS 508 class, the Windows 7 64-bit disk image that belonged to the infamous user nfury.

Parsing that disk image result in 1.324.689 events being extracted, or roughly ~1.3M events.

To generate the graphs I only had to make minor changes to the code. The goal was to measure each event object read from the storage queue and added to the storage library, and the time it was done. This is an one-off change that was just required for this debugging purposes, so it will not be checked in at any point.

Since all the storage related activity is stored in a single file, the storage.py file, that was the only file that required any changes to.

First change was at the top of the file:

import time

This is simply to add the time library to the file, so that I can call the "time.time()" function. The way the tool works is that the engine creates an object called storage writer, which is a consumer object (no need to go into full details here). Basically what happens is that when the engine starts the storage it calls a function called "_ConsumeEventObject" for each entry read from the queue, which in parts adds that object to the storage file. This is the perfect function to add my time measurements. To make it a bit more clear I post the entire content of the storage writer here, with emphasis on what changes I made:

class StorageFileWriter(queue.EventObjectQueueConsumer):

  """Class that implements a storage file writer object."""

  def __init__(self, storage_queue, output_file, buffer_size=0, pre_obj=None):
    """Initializes the storage file writer.

    Args:
      storage_queue: the storage queue (instance of Queue).
      output_file: The path to the output file.
      buffer_size: The estimated size of a protobuf file.
      pre_obj: A preprocessing object (instance of PreprocessObject).
    """
    super(StorageFileWriter, self).__init__(storage_queue)
    self._buffer_size = buffer_size
    self._output_file = output_file
    self._pre_obj = pre_obj
    self._storage_file = None
    self._logfile = open('/tmp/mylogfile', 'wb')

  def _ConsumeEventObject(self, event_object, **unused_kwargs):
    """Consumes an event object callback for ConsumeEventObjects."""
    self._storage_file.AddEventObject(event_object)
    self._logfile.write(u'{0:f} {1:d}\n'.format(time.time(),len(self._queue)))

  def WriteEventObjects(self):
    """Writes the event objects that are pushed on the queue."""
    self._storage_file = StorageFile(
        self._output_file, buffer_size=self._buffer_size, pre_obj=self._pre_obj)
    self.ConsumeEventObjects()
    self._storage_file.Close()
    self._logfile.close()

It boiled down to measly three lines of code (well four if you include the import statement)... when the storage write is initialized I create a text file on my system. To make this extremely simple I just hard coded a filename for this file.

Then I need to write the time and length of the queue to this text file each time an event object is read from the queue. That is done inside the _ConsumeEventObject function, and a simple write operation on the logfile is done there.

And finally I need to close the file when all event objects have been consumed. And that's it, now I just build the tool and run it against the disk image.

All we now need is some way to create simple graphs from this text data, and gnuplot comes to the rescue there. Let's create this really simple gnuplot script:

$ cat make_simple_storage_graph.gnu 
#!/usr/bin/gnuplot

set term png
set output "my_picture.png"
set title "Storage Queue Length"
set xlabel "Time"
set ylabel "Number of objects in queue"

plot "/tmp/mylogfile" title "" with lines

Now we are ready to do some profiling. Armed with the changes to the code and my super 31337 gnuplot script I run the tool against the disk image, using the codebase prior to all changes.


Remember when looking at this graph that there were roughly ~1.3M events extracted in total. And the peak of the storage queue length is close to 900k... so at one point the tool's storage queue consists of almost all the events extracted from the disk image. This is obviously not optimal.


It looks like the tool handles the load quite fine to begin with, but then completely looses control and things start to grow exponentially. It then isn't until the actual processing or extraction is done that the queue length starts to reduce (confirmed the time of when the curve starts dropping to when the tool signaled end of extraction). This obviously has very negative affects on memory usage, since all of this data is kept in memory until it gets flushed to disk.

One of the ideas I had (besides what Joachim implemented) was to add threading to the storage, and offload the queue flushing to a separate thread. that way I thought we could get rid of some of the time it takes to process events into the storage and serializing them as protobufs.
that experiment resulted in this:




As you can see, very similar results, although more flatline to begin with, BUT the top is even worse, it goes ABOVE 1m events, while the other was topping at around 900k.... 

So that CL will never see the light of day.... I may rethink that one and come up with a different way of offloading the load of serialization to make the queue empty faster. There are few problems with the current storage layer that makes it hard to scale up by adding more storage processes. This will be all re-thought and re-engineered hopefully soon enough.

Then after the memory CL got pushed I re-ran the tests, and this is the result:






​As you can now see, considerably less load.. the memory usage got drastically reduced, the cap is around 120k events and interestingly enough this did not have any negative impact on the total time it took to parse the disk image. It took roughly the same amount of time to process it.

This also tells me that by optimizing the storage quite a bit we can reduce the time i takes to process a disk image by quite some time. Since the bottleneck is storage processing... so hold on tight for updates to the storage that will hopefully drastically reduce the time it takes to fully process a disk image. 









And that is it, this is how one can debug a tool like plaso and validate that your proposed fix actually helped solving the issue. I realize there are probably tons of other ways to achieve this and this is in no way meant as the "golden" method for this type of debugging, but this is the way I did it this time and it worked for this particular case, thus I wanted to share it in case someone else may come across similar issue.

Comments

Popular posts from this blog

Parsing the $MFT NTFS metadata file

Incident Response in the Cloud

Container Forensics with Docker Explorer