What’s in a (file) path?

 What’s in a (file) path?


Background

For the experienced reader this might seem a very basic topic, however file paths are things we easily take for granted. I rarely come across DFIR articles that discuss (file) paths, though they are key to many file systems and data formats. There are numerous edge cases that make it challenging to ensure reproducibility [1] of paths in tooling. This article will cover several of these edge cases and possible ways of handling them.


What is a path?

According to Wikipedia [2]: “A path is a string of characters used to uniquely identify a location in a directory structure“. However paths are not limited to file systems, for example the Windows Registry uses key paths.


So more generally a path is typically a single string, that is used to identify a location of an element (or object) in a hierarchical structure.


What does a path consist of?

Let’s take a file path as an example:


C:\Windows\System32


This path identifies the “System32” directory (or file system entry) within the “Windows” directory on the volume with drive letter “C”. This path follows the Windows path convention, it starts with the drive letter and uses “:” to separate the drive letter from the rest of the path, and uses “\” to separate individual path segments.


This path segment separator, which is backslash (\) in the case of the example, is typically a character defined by the operating system or application and not by the file system or data format itself. More on this nuance later.


Note that a path segment separator is something different from a path separator. A path separator commonly refers to the separator (character) used in the PATH environment variable [3].


An application such as Windows Explorer can take the path as input and locate the corresponding directory within the corresponding file systems.


What makes handling paths tricky?

There are multiple aspects that make handling paths from both a tooling but also from a reporting perspective tricky, namely:


  1. The path segment separator and other separators;

  2. The original encoding of the data format and path string;

  3. Environment variables and other types of placeholders used in the path string.


1. The path segment separator and other separators

Previously we indicated that the operating system (or application) controls what path segment separator is used. Different operating systems use different path segment separators, for example Windows uses the backslash (\) while Linux and MacOS use the forward slash (/). For a more comprehensive list see Wikipedia [2].


What if the file system (or data format) allows the path segment separator (or other separators like the drive letter separator or alternate data stream separator) to be used as part of a file (entry) name? Which is the case with for example NTFS.


Let’s assume we have a directory name “base” that is represented by the path “\base”, this directory has 2 entries “sub” (which is a directory) and “sub/marine” (which is a file). The path of the latter file would be “\base\sub/marine”.


What if the operating system uses the forward slash (/) as path segment separator? Suddenly we are no longer able to distinguish between the file “/base/sub/marine” or a file entry “marine” in the directory “/base/sub” on just the path, and have a potential issue with reproducibility.


So what if we want to preserve such a path for automation or to improve reproducibility? There are numerous options here to:


  • specify the file system identifier, such as an inode number or (NTFS/ReFS) file reference in combination with the path string;

  • escape the path segment separator in the name of a segment, such as “/base/sub\/marine”, where the backslash (\) is the escape character;

  • represent the path as a list of segments, such as [“base”, “sub/marine”].


2. The original encoding of the data format and path string

Certain file systems (such as ext2) treat file (entry) names as byte sequences [4] where the operating system defines the encoding, other file systems such as NTFS [5] the format defines the encoding, namely UCS-2 with support for surrogate pairs.


Path strings on the other hand are typically stored as Unicode string, since that is the de facto way to encode a textual string on modern computer systems.


The fact that we are dealing with multiple different encodings can lead to various challenges:


  1. The path string is incorrect due to assumptions or missing information about the original encoding;

  2. The path string is ambiguous, as in it can be represented in multiple ways in the original encoding;

  3. The original encoding contains characters that cannot be expressed in the path string.


A. Assumptions about the original encoding

Let’s take the ext2 file system as an example. While modern versions of Linux mostly use UTF-8 to encode file (entry) names, older versions can use a single-byte-character (SBC) [6] or multi-byte-character (MBC) encoding [7]. The Honeynet Project provides an ext2 file system image that uses SBC encoding [8].


The “honeypot.hda5.dd” storage media image contains a file with inode number 32180 (within the directory “/lib/linuxconf/images”) that contains the following directory entry:


00000000: b4 7d 00 00 18 00 0e 01  45 78 70 6f 72 74 6f 76   .}...... Exportov

00000010: 61 bb 2e 67 69 66 00 00                            a..gif..


Here 0xbb is not a proper UTF-8 code point. An attempt with a “strict” decoder would result in an encoding error, for example with Python 3.11:


bytes([0x45, 0x78, 0x70, 0x6f, 0x72, 0x74, 0x6f, 0x76, 0x61, 0xbb, 0x2e, 0x67, 0x69, 0x66, 0x00]).decode('utf-8')


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 9: invalid start byte


This approach is explicit about an unhandled format (edge) case but not necessarily desirable when automatically processing such a file system. An alternative technique commonly used is to replace the invalid UTF-8 code points with an Unicode substitution character (U+FFFD) (also referred to as “replace” decoder).


bytes([0x45, 0x78, 0x70, 0x6f, 0x72, 0x74, 0x6f, 0x76, 0x61, 0xbb, 0x2e, 0x67, 0x69, 0x66, 0x00]).decode('utf-8', 'replace')


'Exportova�.gif\x00'


Though this allows the tooling to continue parsing we now lost the link to the original name, and potentially have an issue with reproducibility. The same would apply to decoders that omit the invalid UTF-8 code points in the path string.


Alternative options to ensure reproducibility could be to:


  • ensure the original encoding is used;

  • specify the file system identifier in combination with the path string;

  • escape the invalid UTF-8 code points in the file (entry) name, for example 'Exportova'$'\273''.gif' as provided by the native Linux ext2 file system implementation.


B. Ambiguous path string

Given our previous ext2 example assume we know the original encoding and that this is code page cp932 [9]. We might run into another challenge, namely that this encoding allows certain code points to be converted into the same Unicode character [10], which poses a potential issue with reproducibility.


Alternative options to ensure reproducibility could be to:


  • specify the file system identifier in combination with the path string;

  • preserve the original encoding.


C. Troublesome codepoints

Meet U+d800, they are a troublesome character (pun intended). This character is part of the Unicode surrogate range [11]. Strict Unicode requires surrogates to come in pairs, where the combination of the pair describes an Unicode character larger than 16-bits.


Several modern Unicode implementations require surrogates to come in pairs, for example Python 3.11 can represent U+d800 but does not allow it to be encoded as UTF-8:


'\ud800'.encode('utf-8')

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed


Note that Python here is an example and that many other programming languages enforce strict Unicode.


The challenge is that surrogates have been retrofitted on top of earlier (proposed) versions of Unicode. Windows in particular, adopted UTF-16 (which is presumably UCS-2 with support for surrogate pairs) before the strict enforcement. As a result NTFS (and ReFS) file (entry) names have support for unpaired surrogates such as U+d800 to be used in a file (entry) name.


The use of unpaired surrogates in file (entry) names is trivial on Windows and has been observed in real (non-contrived) data sets.


The implications of this are far reaching, since such file names propagate into many other Windows data formats, for example the following Windows Shortcut (LNK):


lnkinfo unicode_U+0000d800.exe-Shortcut.lnk 

lnkinfo 20230713


Windows Shortcut information:

Contains a link target identifier

Contains a relative path string

Contains a working directory string

Number of data blocks : 2


Link information:

Creation time : Jul 10, 2023 04:01:20.797107600 UTC

Modification time : Dec 06, 2019 21:29:00.000000000 UTC

Access time : Jul 10, 2023 04:03:17.058054000 UTC

File size : 11264 bytes

Icon index : 0

Show Window value : 0x00000001

Hot Key value : 0

File attribute flags : 0x00000020

Should be archived (FILE_ATTRIBUTE_ARCHIVE)

Drive type : Fixed (3)

Drive serial number : 0x2ca3d1ae

Volume label

Local path : C:\\test\\unicode_U+0000d800_\U0000d800.exe

Relative path : .\\unicode_U+0000d800_\U0000d800.exe

Working directory : C:\\test


Link target identifier:

Shell item list

Number of items : 4


Shell item: 1

Item type : Root folder

Class type indicator : 0x1f (Root folder)

Shell folder identifier : 20d04fe0-3aea-1069-a2d8-08002b30309d

Shell folder name : My Computer


Shell item: 2

Item type : Volume

Class type indicator : 0x2f (Volume)

Volume name : C:\


Shell item: 3

Item type : File entry

Class type indicator : 0x31 (File entry: Directory)

Name : test

Modification time : Jul 10, 2023 03:59:12

File attribute flags : 0x00000010

Is directory (FILE_ATTRIBUTE_DIRECTORY)

Extension block: 1

Signature : 0xbeef0004 (File entry extension)

Long name : test

Creation time : Jul 10, 2023 03:59:12

Access time : Jul 10, 2023 03:59:12

NTFS file reference : MFT entry: 92882, sequence: 56


Shell item: 4

Item type : File entry

Class type indicator : 0x32 (File entry: File)

Name : UNICOD~1.EXE

Modification time : Dec 06, 2019 21:29:00

File attribute flags : 0x00000020

Should be archived (FILE_ATTRIBUTE_ARCHIVE)

Extension block: 1

Signature : 0xbeef0004 (File entry extension)

Long name : unicode_U+0000d800_\U0000d800.exe

Creation time : Jul 10, 2023 04:01:22

Access time : Jul 10, 2023 04:01:28

NTFS file reference : MFT entry: 386618, sequence: 63


Data block: 1

Signature : 0xa0000003 (Distributed link tracker properties)

Machine identifier : test

Droid volume identifier : d28804b0-3144-48de-b6da-6bbf800a0016

Droid file identifier : 63206bc4-1ed5-11ee-a2f7-525400eeb605

Birth droid volume identifier : d28804b0-3144-48de-b6da-6bbf800a0016

Birth droid file identifier : 63206bc4-1ed5-11ee-a2f7-525400eeb605


Data block: 2

Signature : 0xa0000009 (Metadata property store)

{dabd30ed-0043-4789-a7f8-d013a4736622}/100 (PKEY_ItemFolderPathDisplayNarrow)

Value (0x001f) : test (C:)


{b725f130-47ef-101a-a5f1-02608c9eebac}/10 (PKEY_ItemNameDisplay)

Value (0x001f) : unicode_U+0000d800_\U0000d800.exe


{b725f130-47ef-101a-a5f1-02608c9eebac}/15 (PKEY_DateCreated)

Value (0x0040) : Jul 10, 2023 04:01:22.000000000 UTC


{b725f130-47ef-101a-a5f1-02608c9eebac}/12 (Unknown)

Value (0x0015) : 11264


{b725f130-47ef-101a-a5f1-02608c9eebac}/4 (PKEY_ItemTypeText)

Value (0x001f) : Application


{b725f130-47ef-101a-a5f1-02608c9eebac}/14 (PKEY_DateModified)

Value (0x0040) : Dec 06, 2019 21:29:00.000000000 UTC


{28636aa6-953d-11d2-b5d6-00c04fd918d0}/30 (PKEY_ParsingPath)

Value (0x001f) : C:\\test\\unicode_U+0000d800_\U0000d800.exe


{446d16b1-8dad-4870-a748-402ea43d788c}/104 (System.VolumeId)

Value (0x0048) : 9cdf8dfa-0000-0000-0000-501f00000000


Note that in the output above the Windows path segment separator (\) is escaped with a backslash to be able to represent U+d800 in the "\U########" notation. The short-hand variant "\u####" is not used to prevent ambiguity in case-insensitive path representation.


Here U+d800 propagated into the shortcut (LNK) [12], shell items [13] and property store [14] data formats and has been observed in NTFS, ReFS, Window Prefetch and Jump List formats as well.


Such unpaired surrogates can be problematic for converting to formats that require strict Unicode such as XML or format that do not define an encoding such as the body file format [15]. Other special (Unicode) code points, such as U+0 [16] or code points that fall outside the valid Unicode ranges, or characters that are restricted by the operating system might be similarly problematic.


Alternative options to ensure reproducibility could be to:


  • escape troublesome codepoints in the path string;

  • preserve the original encoding.


3. Environment variables and other types of placeholders

So far we have mostly discussed paths in their original format, but paths are often stored in other formats as references or as part of a configuration. In such cases we often only have a path string and rarely corresponding information like a file system identifier. 


Sometimes placeholders are used, such as environment variables or known folder identifiers [17], for example:


C:\Users\%USERNAME%\Documents


Where “%USERNAME%” is a placeholder for the username of the active user.


To reconstruct what such a path referred to at runtime one needs to tap into other system sources, for example environment variables stored in the Windows Registry of the system [18]. Sometimes the path deliberately refers to a different location in another context.


Note that some tooling uses predefined references lists to translate such placeholders. However numerous of such lists are solely based on empirical evidence. Therefore when using such lists tread with caution and ensure such lists are comprehensive and appropriate for your use case.


Impact of naive path handling

Naive handling of paths can lead to various issues, some observed issues:


  • An indicator of compromise (IOC) contained U+d800 in a path but the tooling to scan for paths translated this into U+fffd (replacement character) and therefore no results were found.

  • A live collection tool failing to collect files due to strict handling of an unpaired surrogate in a file path.

  • U+d800 was written to a body file and Python 3.11 refuses to read it as (strict) UTF-8.

  • A file system analysis tool was unable to correctly represent ext2 file (entry) names since it requires information about the original encoding and output random characters instead.

  • A “de facto” commercial forensics tool converts U+0001-U+0008 to U+00ba, strips U+0009 and U+000a, and several more alterations without any notice and is perfectly happy to write the results into a logical image file in the altered form without any record of the modification.


Conclusion

Ensuring (digital forensics) reproducibility even for basic things such as paths is apparently complex. It also does not seem a topic widely discussed by digital forensics analysts or tool authors. Most of the discussions appear to revolve around the latest shiny data format or tool.


Though there are various software development sources available that provide more insight into the topic [19, 20] it seems they are rarely covered in DFIR related publications. If we as a field want better automation, we have to embrace that data formats have edge cases. Evaluation of edge cases should be a central part of our research, validation, methodologies and tooling.


If you feel you have additional interesting edge cases, or have a good solution to a complicated reproducibility problem, do not hesitate to reach out on the Open Source DFIR Slack community.


Comments

Popular posts from this blog

Parsing the $MFT NTFS metadata file

Incident Response in the Cloud

Container Forensics with Docker Explorer