Find the needle faster with hashR data

Co-author: Janosch Köpper

A challenge in compromise investigations is the volume of data to be analysed. In a previous article we showed how hashR can be used to generate custom hash sets. In this article we demonstrate how such a custom hash set can speed up your investigation greatly by being able to find files (new binaries, modified configs) that are not part of a base (operating system) image.

In this article we are going to walk through investigating a compromised GCE VM running CentOS. Let’s assume we get an alert from our detection systems that this VM connected to an IP associated with a nation state (APT) actor.

Processing and preparing the data 

First, we will run dfTimewolf’s gcp_forensics to acquire the disk from the compromised VM and prepare our investigative environment:

dftimewolf gcp_forensics --instances <compromised_vm_name> --analysis_project_name <analysis_project_name> <compromised_vm_project_name>

This command will perform the following steps: 

  1. Copy the disk from the compromised VM to our analysis GCP project
  2. Create an analysis VM in our analysis GCP project 
  3. Attach the disk from the compromised VM to the analysis VM
  4. Install Plaso and other open-source forensic tooling 

Once this process is complete we can ssh into the analysis VM and run the following command to process the source disk with Plaso: 

sudo --yara_rules rules.yar --storage-file timeline.plaso /dev/<compromised_disk_name>

Plaso will scan the content of all files and tag them if they are responsive to YARA rules located in rules.yar file. In our case the rules.yar file contains a rule to detect ELF files, we will use that later to quickly filter all events related to ELF files in Timesketch. Once Plaso processing is complete we need to upload the output .plaso file to Timesketch using dftimewolf or via the web UI.

Timesketch has a hashR lookup analyzer that checks SHA256 hash values extracted from events against a hashR database and tags events in Timesketch accordingly. For the analyzer to work properly with our hashR database, it needs to be configured by setting the database connection information in the timesketch.conf configuration file of our Timesketch installation.

hashR config in timesketch.conf:

#-- hashR integration --#
# Uncomment and fill this section if you want to use the hashR lookup analyzer.
# Provide hashR postgres database connection information below:
HASHR_DB_USER = 'hashRuser'
HASHR_DB_PW = 'hashRpass'
HASHR_DB_PORT = '5432'
HASHR_DB_NAME = 'hashRdb'
# The total number of unique hashes that are checked against the database is
# split into multiple batches. This number defines how many unique hashes are
# checked per query. 50000 is the default value. #

# Set as True if you want to add the source of the hash ([repo:imagename]) as
# an attribute to the event. WARNING: This will increase the processing time
# of the analyzer!

The configuration options below need to be uncommented and updated to refer to the hashR PostgreSQL database used.

The HASHR_QUERY_BATCH_SIZE can be left commented and only needs to be tweaked if you notice performance issues like a slow connection to the database or limited memory availability. The value defines how many unique hashes are checked against the database as a batch in one select statement.

HashR also records which base image it has seen the hash values. Setting HASHR_ADD_SOURCE_ATTRIBUTE = True will add an attribute to each event containing this source image information. For example, the resulting event attribute would look like this:

hashR_sample_sources: ["GCP:ubuntu-os-cloud-ubuntu-2004-focal-v20220905"]

This will increase the processing time of the analyzer, activate this feature only if you definitely need the information. We can always query the hashR database directly if we want to get this information afterwards.

Important: For the changes in the timesketch.conf file to take effect, the Timesketch Docker container needs to be restarted! (docker-compose restart) Since our hashR instance is configured to ingest all public GCP base operating system (OS) images, it contains the hashes of all known files in the compromised CentOS instance of this scenario. To run the analyzer on the imported Plaso data, we need to wait for the Plaso file to be fully indexed and then we can start the "hashR lookup" analyzer via the Analyze tab in Timesketch. We select the timeline that we just uploaded and the "hashR lookup" analyzer from the list. Then we hit the green "Run 1 analyzers on 1 timelines" button.

Analyzing the data 

In this article we are dealing with a malware sample that uses an encrypted configuration file and was specifically built for this compromise scenario. Running grep to find the bad IP address on the disk does not yield any results. Nor does querying all hashes against our favorite intelligence platform. Let's see what we can do with the hashR data.

At this point we already have our timeline uploaded to Timesketch and ready to be analyzed. Our timeline contains 308,107 events, which is quite a lot to look at given that we don’t have a good starting point for our investigation. Let’s run a query to see how many files are allocated on the file system: 

data_type:"fs:stat" AND timestamp_desc:"Creation Time"

This leaves us with 72,244 events, which is still quite a lot. It is a reasonable assumption that the malware is some type of binary file. We can search for all ELF files using the YARA rule we passed to Plaso: 

yara_match:"executables_ELF" AND timestamp_desc:"Creation Time"

4256 hits, much better, but still too much for manual analysis. We can now filter out the files that were tagged by the hashR analyser as originating from base OS images.

yara_match:"executables_ELF" AND timestamp_desc:"Creation Time" AND NOT tag:known-hash 

This gives us 141 binaries and it’s a number that sounds reasonable for a manual review. While scrolling through the results we notice a lot of hits related to ELF files being created in usr/lib/apache2, let’s try to filter these out as it’s likely that this is an artifact of legitimate apache2 installation. 

yara_match:"executables_ELF" AND timestamp_desc:"Creation Time" AND NOT tag:known-hash AND NOT filename:"*usr/lib/apache2*"

We have 23 files for review. One of the first files on the list has an interesting path: /usr/bin/sshd. Usually sshd binaries should be located in the /usr/sbin/sshd path and be present in a known base image, let’s take a closer look at the file. 


00003e80: c7b8 0000 0000 e875 ecff ff48 8d85 e0ef .......u...H....

00003e90: ffff 488d 3539 6b04 0048 89c7 e86f ecff ..H.59k..H...o..

00003ea0: ff48 8945 e848 8b45 e848 89c7 e86f ecff .H.E.H.E.H...o..

00003eb0: ff48 c745 e800 0000 00e9 8d02 0000 488d .H.E..........H.

00003ec0: 85e0 efff ff48 bb48 4944 455f 5448 4948 .....H.HIDE_THIH

00003ed0: 8918 48b9 535f 5348 454c 4c3d 4889 4808 ..H.S_SHELL=H.H.

00003ee0: 48bb 5820 7365 6420 272f 4889 5810 48b9 H.X sed '/H.X.H.

00003ef0: 2373 7368 6466 6c61 4889 4818 48bb 672f #sshdflaH.H.H.g/

00003f00: 2c24 2164 2720 4889 5820 48b9 2f73 6269 ,$!d' H.X H./sbi

00003f10: 6e2f 6966 4889 4828 48bb 7570 2d6c 6f63 n/ifH.H(H.up-loc

00003f20: 616c 4889 5830 48b9 203e 202f 746d 702f alH.X0H. > /tmp/

00003f30: 4889 4838 48bb 7379 7374 656d 642d 4889 H.H8H.systemd-H.

00003f40: 5840 48b9 7072 6976 6174 652d 4889 4848 X@H.private-H.HH

00003f50: 66c7 4050 7565 c640 5200 488d 85e0 efff f.@Pue.@R.H.....

00003f60: ff48 8d35 6a6a 0400 4889 c7e8 a0eb ffff .H.5jj..H.......

00003f70: 4889 45e8 488b 45e8 4889 c7e8 a0eb ffff H.E.H.E.H.......

00003f80: 48c7 45e8 0000 0000 488d 85e0 efff ff48 H.E.....H......H

00003f90: bb48 4944 455f 5448 4948 8918 48b9 535f .HIDE_THIH..H.S_

00003fa0: 5348 454c 4c3d 4889 4808 48bb 7820 7365 SHELL=H.H.H.x se

00003fb0: 6420 2d69 4889 5810 48b9 202d 6520 2773 d -iH.X.H. -e 's

00003fc0: 2f73 4889 4818 48bb 7368 6420 2031 2f73 /sH.H.H.shd 1/s

00003fd0: 4889 5820 48b9 7368 6420 2030 2f67 4889 H.X H.shd 0/gH.

00003fe0: 4828 48bb 2720 2f74 6d70 2f73 4889 5830 H(H.' /tmp/sH.X0

00003ff0: 48b9 7973 7465 6d64 2d70 4889 4838 48bb H.ystemd-pH.H8H.

00004000: 7269 7661 7465 2d75 4889 5840 66c7 4048 rivate-uH.X@f.@H

00004010: 6500 488d 85e0 efff ff48 8d35 b269 0400 e.H......H.5.i.. 

The following string definitely looks sketchy:

HIDE_THIS_SHELL=X sed '/#sshdflag/,$!d' /sbin/ifup-local > /tmp/systemd-private-ue

By having a quick search of OSINT sources we find that the HIDE_THIS_SHELL string is likely linked to Azazel, which is a userland open-source rootkit. To confirm our suspicion we take a closer look using static and/or dynamic malware analysis.

Besides quickly finding the malware we now also have new IOCs to pivot on, namely file paths and timestamps. 


By using hashR data we were able to quickly identify potentially interesting binaries in just a couple of queries in Timesketch without having any meaningful data to begin our investigation. 

This method is not limited to executable files only. We could do a similar approach but focus on other file types instead, such as bash scripts. That would allow us to find all bash scripts that were either modified or not present on the base OS image. 

In this case attackers modified /sbin/ifup-local bash script and used it as a persistence mechanism. This finding would allow you to pivot to other components of the malware.

If you use customized base OS images in your environment it’s a good idea to ingest those in hashR. This way, by utilizing hash sets generated from your own data, you’ll be able to find configuration files, scripts, binaries and other files that might have been altered by potential attackers. 

Closing note 

Be mindful how you use hashR data and other hash sets (e.g. NSRL). These types of sources should not be considered as sets of “known-good” files, they are only “known”. There are a couple of reasons for that: 
  • Legitimate files can become exploitable 
  • Attackers can use legitimate files to achieve their goals
  • Attackers might compromise your base image repositories 
Having said that, hashR data can help you to speed up your investigations and build new and interesting ways of finding anomalies (including bad ones).


Popular posts from this blog

Incident Response in the Cloud

Parsing the $MFT NTFS metadata file

Container Forensics with Docker Explorer