Targeted timelines - Part I
Targeted Timeline Collection
by Kristinn Gudjonsson
First there are analysts that prefer to use the "kitchen-sink" approach. This school of thought prefers to running log2timeline against an image and extract each-and-every timestamp it supports and do the analysis on the full dataset after collection, what has often been phrased a super-timeline.
Secondly there are analysts that like do a more targeted acquisition of data. This school of thought prefers to only collect data from files relevant for their needs. Those analysts either tend to use more advanced features log2timeline or a mixture of log2timeline and or other tools, e.g. one-off scripts are more useful than any available tool.
These approaches both have their pros and cons and it all really depends on the type of case you are dealing with, at least IMHO. And I'm not about to start a debate here which one is better, since again it really really really depends...
There is of course a hybrid approach where the analyst starts by kicking of a full kitchen-sink approach and then extracts a mini-timeline or timelines from that super-timeline for analysis. That way you still have that full and complete timeline just in case it is ever needed yet the analyst really only needs to take a look at the data contained in the mini-timeline.
I tend to allow the questions that arise in the case at hand to tell me which approach is the better suited. Therefore my aim with plaso is to support both schools of thought, including the hybrid solution.
What We've Got Now
The first version of plaso that got released (1.0alpha) had no good support for the user other than to do the full kitchen-sink approach. In this particular case it is not of big deal since the dataset that this version of plaso produces is still relatively small. Largely because of the limited number of parsers currently being included. The filtering language this version of plaso included did however give the analyst the power to extract mini-timelines. Essentially supporting a version of the hybrid approach.Times change though and with new versions of plaso there will be more parsers that extract even more events, leading to a larger dataset to correlate and analyze. And at the same time this will cause the tool to take more time to parse all that data out. Take into account the scenario that an analyst is looking at situation where 10 to 100 hosts that are possible compromised and you'll realize that the dataset keeps growing and growing and time starts become a problem.
What if you really have a well defined goal of your investigation? Or what if you know beforehand what data you most likely need to extract based on other sources? In these cases there really isn't any need to go about and collect everything just to filter it out later.
New World
Enter the world of "targeted timeline collection" in plaso. This is first approach, be it a very basic one, to do a kitchen-sink approach on a limited, or targeted, set of input. For this plaso uses a text file that defines the paths to files that should be parsed, ignoring the rest. This opens up the possibility of a fast method of extracting just the dataset you are interested in using a single tool. That also has the benefit of having the same output and storage making it easier to correlate events from different data sources and filter out for more granular analysis. The same method that you've become so fond of using the "traditional" kitchen-sink super-timelines.The current format of the text-file is very straightforward: it contains a single line for each path (each part of the path is separated by a FORWARD slash), and it can contain a regular expression, as many as you please. In addition to that you can also use attributes that are extracted from the preprocessing modules. The format is simple, one entry should be entered per line in this file.
An example file could be something as simple as this, or yet even simpler for that manner:
/(Users|Documents And Settings)/.+/NTUSER.DAT
{sysregistry}/(SAM|SOFTWARE|SECURITY|SYSTEM)
/(Users|Documents And Settings)/.+/AppData/Local/Google/Chrome/Default
/(Users|Documents And Settings)/.+/Local Settings/Application Data/Google/Chrome/Default
/(Users|Documents And Settings)/.+/AppData/Roaming/Mozilla/Firefox/Profiles/.+/places.sqlite
/(Users|Documents And Settings)/.+/Local Settings/Application Data/Mozilla/Firefox/Profiles/.+/places.sqlite
/Users/.+/AppData/Roaming/Microsoft/Windows/Recent/.+\.LNK
/Documents And Settings/.+/Recent/.+\.LNK
{systemroot}/winevt/Logs/.+evtx
{sysregistry}/.+evt
Let's take one of these lines:
/(Users|Documents And Settings)/.+/NTUSER.DAT
This simple regular expression states that we are looking for a file that is either in the folder "Users" or "Documents And Settings" and then for every subfolder therein (.+) go and include the NTUSER.DAT file. Note that all these look-ups are case insensitive.
In other words this line defines that plaso should parse the NTUSER.DAT file for every user on the system.
Let's take another line:
{systemroot}/winevt/Logs/.+evtx
In other words this line defines that plaso should parse all EVTX files that are stored under the "winevt\Logs" path under the system root directory.
Let's take another example:
/(Users|Documents And Settings)/.+/AppData/Local/Google/Chrome/Default
/(Users|Documents And Settings)/.+/Local Settings/Application Data/Google/Chrome/Default
/(Users|Documents And Settings)/.+/AppData/Roaming/Mozilla/Firefox/Profiles/.+/places.sqlite
/(Users|Documents And Settings)/.+/Local Settings/Application Data/Mozilla/Firefox/Profiles/.+/places.sqlite
Here we are simply defining that plaso should include a few sources of browser history as well.
How To Use It?
You might be asking how can I use this with the tool? Well I'm glad you asked since I did not discuss this before in this blog post... the answer is by using the "-f" switch, e.g.:log2timeline.py -f /cases/l2t_filters/simple.txt -o 63 -w /cases/12345/l2t.dump myrandomimage.dd
How Much Difference Does It Really Make?
You may ask yourself that question, so let's show you some very simple data from a very simple test images. Both targeted and non-targeted collection is run from the same machine with same load using the same parameters to log2timeline, the only difference being one using the targeted collection approach and the other one not.Image 1 - small XP test image.
Without targeted approach: 6:36 to complete, 138.893 events collected.
With targeted approach: 0:46 to complete, 41.849 events collected.
Difference: Targeted collection took about 11% of the time of a full collection. And gave back about 30% of all the collected events.
Image 2 - small Win 7 test image with four VSS snapshots.
N.b. evtx and hachoir parsers were turned off in this run.
Without targeted approach: 20:21 to complete, 1.843.864 events collected.
With targeted approach: 5:11 to complete, 742.027 events collected.
Difference: Targeted collection too about 25% of the of a full collection and gave back about 40% of the collected events.
These numbers may not have any real specific meaning, other than to state the obvious that it takes considerably shorter time to only include a select list of files to parse than to recursively go through an entire image to include every file found.
Summary
The difference of using targeted collection instead of kitchen-sink can be quite significant, difference that will only grow with each new parser being introduced into the tool. The amount of data collected can also be quite intimidating and overwhelming to an analyst, With carefully constructed targeted collection you can therefore both save considerable time during processing as well as during the analysis phase, making it less likely that the events of interest are drowned inside the sea of irrelevant events. Targeted collection also makes scaling up the investigation simpler and actually possible, collecting targeted data from hundreds or thousands of machines.
And this is only the first step of many to come in which you will see log2timeline move more into the realm of targeted collection and parsing of data. There are ideas on the table on defining these files in a more structured manner that will make either the collection or the post-processing of data more targeted as well.
This blog should contain more posts in the future that will discuss the use of these targeted collections alongside the use of filters to make the dataset even more targeted. Any ideas of improvements, novel filter files or any other feedback much appreciated, whether that is via comments, emails or other type of communications to any of the developers.
Kristinn,
ReplyDeleteIt's very interesting to see this discussion continuing, particularly as it pertains to the new tool being created.
I tend to see the value in both approaches. For example, the "kitchen sink" approach is great for folks with limited experience or background in artifacts, artifact analysis, and timeline analysis. It provides a great deal of data, yes, but the fact is that it also allows someone to ultimately see the additional artifacts that provide that much-needed context.
On the opposite end of the spectrum is, as you've pointed out, a very targeted approach, based on knowing which artifacts one needs. For example, in the initial stages of analyzing a SQLi attack, perhaps the user's hives aren't something that you'd necessarily want to have included in the timeline.
I personally like to have a mix of the two, along with external analysis to use and augment my pivot points in the timeline. One of the things I've found very beneficial is to include general hive data (LastWrite times on keys), particularly when an infection occurs...I've picked up a bunch of different artifacts that were either created by the particular variant, or not addressed in previous write-ups because they were deemed unimportant...but these have really helped answer some questions.
Thanks for a well-thought-out post.
I clearly see the value in both approaches and I want the tool to support both, whether that is for statistical analysis of the result set, determining baseline of "normal" behavior of a user and compare that against a behavior in particular point in time or as you pointed out for people less experienced in artifacts then I agree kitchen-sink is sometimes the best way to go (and there are more than likely other scenarios as well).
ReplyDeleteAnd I agree that general hive data (as in all last written timestamps of keys) has often proved it's value and in this first approach of targeted collection in log2timeline that is exactly what it is: "a kitchen-sink approach on targeted set of files" ... meaning you choose what files to include, and then all parsers that extract everything from those files is run against them. Thus if you include a registry hive all keys are included, both the ones that have "specific" parsing as well as those that don't. Then you can augment that with the use of filters to really get the targeted dataset that you were looking for.
In the "future", whenever that comes, there will be more targeted approaches, where both targeted files will be collected as well as only events of interest are extracted out. Something on the lines of "give me all user downloads from the user "joe", which would find all browser history of "joe" and just include events that contain files downloaded. That would be a more targeted approach that requires some structure on how to define those specifications, as in where to find the files, which parser to use and what criteria used to filter out the dataset. This can all be done now using the collection filter to parse out and then define filters yourself on the resulting dataset, it just requires more manual labor as of now ;)
Kristinn,
ReplyDeleteThanks for sharing this post - I agree also that both approaches have their place depending on the challenge at hand. I do tend to leverage the hybrid method you referred to as well, so this -f option looks promising... I am really looking forward to getting some time to test plaso.
On other note, the "TimeWrinkle" concept introduced by Mandiant in Redline is something that would be very helpful when using the kitchen-sink type approach with l2t, where you can quickly pick out some artifacts, then filter to a minute or so before and after each of those artifacts/events > into one 'wrinkled' timeline. I guess this could be done manually now with psort (? haven't had opportunity to test yet), but something similar would be a nice automation aid.
Kristinn,
ReplyDeleteI don't think that, for the most part, analysis is thought of in those terms.
For example, if you were to talk to a group of analysts, I have no doubt that what you described is exactly what most of them would want...the ability to do that. However, if you were to get down to the mechanics of those tasks, how it would be done and via what artifacts, and even down to the point of identifying the artifacts, that's where the discussion would break down.
My point is that I do agree with what you're saying, but the part about "whenever that comes"...that relies on other factors that just analysts wanting it.
Another thing I would add to the mix is the need for including intelligence and past experiences of analysts into the tools themselves, to help identify those critical pivot points and 'wrinkles in time'.
Drew:
ReplyDeleteYes the "TimeWrinkle" approach has been suggested before and that can be done easily using manual labor in the psort command, however it is not built in yet as a quick feature. That will be added to the tool, thanks for the reminder.
Harlan:
Yes I agree with you, and I even had a lengthy answer written up but somehow it didn't get posted, so here it goes again hoping this time it will be ;)
That is to say I agree with you that the tool needs to be able to leverage the investigators previous experience, building intelligence into the tool is important IMHO. The current stable branch does this already, or at least somewhat, look for the "find all evil" script and lightning talk at last years DFIR summit (talk called "Automating Your Timeline Analysis in 360 Seconds"). There I talked about the addition of YARA signature matching against timelines. Using this option you could easily build up your own YARA signatures either for your own personal "intelligence stash" or share the love and use it as an initial step in your investigation, just run that over the timeline and you would get hits from all your previous experience/known pivot points/bad things.
This feature not yet been re-introduced into the new tool, however it is on the roadmap. I'm still not fully sure what is the best approach here, is it to continue down that path and build in YARA support (easy enough) or is rather to use a larger framework, like OpenIOC's, a limited support could be built into the tool to support matching against some known criteria there.
Whatever approach will be ultimately chosen I believe it is important to support that. Having the ability to just press a button and automatically run it through your previous experience to find known pivot points could drastically reduce the response time, no doubt. And also help junior team members to quickly discover those pivot points, derived from more the experience of more senior people in the team.
I will most likely add the YARA support back into the tool with some extended support, such as fully supporting the range of attributes, etc. and write a quick blog post detailing the use of these rules. For some reasons I didn't really discuss this feature of the tool that much outside of that brief 360 talk so I'm not sure how many people are aware of it or actually use it.