What’s in a (file) path?
What’s in a (file) path?
Background
For the experienced reader this might seem a very basic topic, however file paths are things we easily take for granted. I rarely come across DFIR articles that discuss (file) paths, though they are key to many file systems and data formats. There are numerous edge cases that make it challenging to ensure reproducibility [1] of paths in tooling. This article will cover several of these edge cases and possible ways of handling them.
What is a path?
According to Wikipedia [2]: “A path is a string of characters used to uniquely identify a location in a directory structure“. However paths are not limited to file systems, for example the Windows Registry uses key paths.
So more generally a path is typically a single string, that is used to identify a location of an element (or object) in a hierarchical structure.
What does a path consist of?
Let’s take a file path as an example:
This path identifies the “System32” directory (or file system entry) within the “Windows” directory on the volume with drive letter “C”. This path follows the Windows path convention, it starts with the drive letter and uses “:” to separate the drive letter from the rest of the path, and uses “\” to separate individual path segments.
This path segment separator, which is backslash (\) in the case of the example, is typically a character defined by the operating system or application and not by the file system or data format itself. More on this nuance later.
Note that a path segment separator is something different from a path separator. A path separator commonly refers to the separator (character) used in the PATH environment variable [3].
An application such as Windows Explorer can take the path as input and locate the corresponding directory within the corresponding file systems.
What makes handling paths tricky?
There are multiple aspects that make handling paths from both a tooling but also from a reporting perspective tricky, namely:
The path segment separator and other separators;
The original encoding of the data format and path string;
Environment variables and other types of placeholders used in the path string.
1. The path segment separator and other separators
Previously we indicated that the operating system (or application) controls what path segment separator is used. Different operating systems use different path segment separators, for example Windows uses the backslash (\) while Linux and MacOS use the forward slash (/). For a more comprehensive list see Wikipedia [2].
What if the file system (or data format) allows the path segment separator (or other separators like the drive letter separator or alternate data stream separator) to be used as part of a file (entry) name? Which is the case with for example NTFS.
Let’s assume we have a directory name “base” that is represented by the path “\base”, this directory has 2 entries “sub” (which is a directory) and “sub/marine” (which is a file). The path of the latter file would be “\base\sub/marine”.
What if the operating system uses the forward slash (/) as path segment separator? Suddenly we are no longer able to distinguish between the file “/base/sub/marine” or a file entry “marine” in the directory “/base/sub” on just the path, and have a potential issue with reproducibility.
So what if we want to preserve such a path for automation or to improve reproducibility? There are numerous options here to:
specify the file system identifier, such as an inode number or (NTFS/ReFS) file reference in combination with the path string;
escape the path segment separator in the name of a segment, such as “/base/sub\/marine”, where the backslash (\) is the escape character;
represent the path as a list of segments, such as [“base”, “sub/marine”].
2. The original encoding of the data format and path string
Certain file systems (such as ext2) treat file (entry) names as byte sequences [4] where the operating system defines the encoding, other file systems such as NTFS [5] the format defines the encoding, namely UCS-2 with support for surrogate pairs.
Path strings on the other hand are typically stored as Unicode string, since that is the de facto way to encode a textual string on modern computer systems.
The fact that we are dealing with multiple different encodings can lead to various challenges:
The path string is incorrect due to assumptions or missing information about the original encoding;
The path string is ambiguous, as in it can be represented in multiple ways in the original encoding;
The original encoding contains characters that cannot be expressed in the path string.
A. Assumptions about the original encoding
Let’s take the ext2 file system as an example. While modern versions of Linux mostly use UTF-8 to encode file (entry) names, older versions can use a single-byte-character (SBC) [6] or multi-byte-character (MBC) encoding [7]. The Honeynet Project provides an ext2 file system image that uses SBC encoding [8].
The “honeypot.hda5.dd” storage media image contains a file with inode number 32180 (within the directory “/lib/linuxconf/images”) that contains the following directory entry:
Here 0xbb is not a proper UTF-8 code point. An attempt with a “strict” decoder would result in an encoding error, for example with Python 3.11:
This approach is explicit about an unhandled format (edge) case but not necessarily desirable when automatically processing such a file system. An alternative technique commonly used is to replace the invalid UTF-8 code points with an Unicode substitution character (U+FFFD) (also referred to as “replace” decoder).
Though this allows the tooling to continue parsing we now lost the link to the original name, and potentially have an issue with reproducibility. The same would apply to decoders that omit the invalid UTF-8 code points in the path string.
Alternative options to ensure reproducibility could be to:
ensure the original encoding is used;
specify the file system identifier in combination with the path string;
escape the invalid UTF-8 code points in the file (entry) name, for example 'Exportova'$'\273''.gif' as provided by the native Linux ext2 file system implementation.
B. Ambiguous path string
Given our previous ext2 example assume we know the original encoding and that this is code page cp932 [9]. We might run into another challenge, namely that this encoding allows certain code points to be converted into the same Unicode character [10], which poses a potential issue with reproducibility.
Alternative options to ensure reproducibility could be to:
specify the file system identifier in combination with the path string;
preserve the original encoding.
C. Troublesome codepoints
Meet U+d800, they are a troublesome character (pun intended). This character is part of the Unicode surrogate range [11]. Strict Unicode requires surrogates to come in pairs, where the combination of the pair describes an Unicode character larger than 16-bits.
Several modern Unicode implementations require surrogates to come in pairs, for example Python 3.11 can represent U+d800 but does not allow it to be encoded as UTF-8:
Note that Python here is an example and that many other programming languages enforce strict Unicode.
The challenge is that surrogates have been retrofitted on top of earlier (proposed) versions of Unicode. Windows in particular, adopted UTF-16 (which is presumably UCS-2 with support for surrogate pairs) before the strict enforcement. As a result NTFS (and ReFS) file (entry) names have support for unpaired surrogates such as U+d800 to be used in a file (entry) name.
The use of unpaired surrogates in file (entry) names is trivial on Windows and has been observed in real (non-contrived) data sets.
The implications of this are far reaching, since such file names propagate into many other Windows data formats, for example the following Windows Shortcut (LNK):
Note that in the output above the Windows path segment separator (\) is escaped with a backslash to be able to represent U+d800 in the "\U########" notation. The short-hand variant "\u####" is not used to prevent ambiguity in case-insensitive path representation.
Here U+d800 propagated into the shortcut (LNK) [12], shell items [13] and property store [14] data formats and has been observed in NTFS, ReFS, Window Prefetch and Jump List formats as well.
Such unpaired surrogates can be problematic for converting to formats that require strict Unicode such as XML or format that do not define an encoding such as the body file format [15]. Other special (Unicode) code points, such as U+0 [16] or code points that fall outside the valid Unicode ranges, or characters that are restricted by the operating system might be similarly problematic.
Alternative options to ensure reproducibility could be to:
escape troublesome codepoints in the path string;
preserve the original encoding.
3. Environment variables and other types of placeholders
So far we have mostly discussed paths in their original format, but paths are often stored in other formats as references or as part of a configuration. In such cases we often only have a path string and rarely corresponding information like a file system identifier.
Sometimes placeholders are used, such as environment variables or known folder identifiers [17], for example:
Where “%USERNAME%” is a placeholder for the username of the active user.
To reconstruct what such a path referred to at runtime one needs to tap into other system sources, for example environment variables stored in the Windows Registry of the system [18]. Sometimes the path deliberately refers to a different location in another context.
Note that some tooling uses predefined references lists to translate such placeholders. However numerous of such lists are solely based on empirical evidence. Therefore when using such lists tread with caution and ensure such lists are comprehensive and appropriate for your use case.
Impact of naive path handling
Naive handling of paths can lead to various issues, some observed issues:
An indicator of compromise (IOC) contained U+d800 in a path but the tooling to scan for paths translated this into U+fffd (replacement character) and therefore no results were found.
A live collection tool failing to collect files due to strict handling of an unpaired surrogate in a file path.
U+d800 was written to a body file and Python 3.11 refuses to read it as (strict) UTF-8.
A file system analysis tool was unable to correctly represent ext2 file (entry) names since it requires information about the original encoding and output random characters instead.
A “de facto” commercial forensics tool converts U+0001-U+0008 to U+00ba, strips U+0009 and U+000a, and several more alterations without any notice and is perfectly happy to write the results into a logical image file in the altered form without any record of the modification.
Conclusion
Ensuring (digital forensics) reproducibility even for basic things such as paths is apparently complex. It also does not seem a topic widely discussed by digital forensics analysts or tool authors. Most of the discussions appear to revolve around the latest shiny data format or tool.
Though there are various software development sources available that provide more insight into the topic [19, 20] it seems they are rarely covered in DFIR related publications. If we as a field want better automation, we have to embrace that data formats have edge cases. Evaluation of edge cases should be a central part of our research, validation, methodologies and tooling.
If you feel you have additional interesting edge cases, or have a good solution to a complicated reproducibility problem, do not hesitate to reach out on the Open Source DFIR Slack community.
Comments
Post a Comment