Let’s get our hands dirty. First step in building fraud investigation and security analytics platform with TeaLeaf is making TeaLeaf’s data available for Splunk. Then Splunk will take care of all the deep security queries and specialized investigative dashboarding.
Disclaimer: all data you see on this site was autogenerated for demonstration purposes. It demonstrates concepts and ideas but does not shows any real names, IP addresses and any other information that matches real world events.
TeaLeaf comes with cxConnect for Data Analysis component.
“Tealeaf cxConnect for Data Analysis is an application that enables the transfer of data from your Tealeaf CX datastore to external reporting environments. Tealeaf cxConnect for Data Analysis can deliver data in real-time to external systems such as event processing systems or enable that data to be retrieved in a batch mode. Extraction of customer interaction data into log files, SAS, Microsoft SQL Server or Oracle databases are supported. Data extraction jobs can be run on a scheduled or ad-hoc basis. Flexible filters and controls can be used to include or exclude any sessions or parts of sessions, according to your business reporting needs“.
Source: IBM TeaLeaf.
Although from my experience “real-time” claim is a long shot (at least I didn’t find a way to accomplish above in real-time), but I managed to do pretty successful regular, hourly, detailed TeaLeaf log exports.
If you’d try to use cxConnect right off the bat for log exports and select all default options – you’ll end up with humongous set of files that will contain mountain data you don’t really need wasting your disk space. It took me quite a while to configure cxConnect to export data that i need and to make it not include data that i don’t need.
Within cxConnect “Configured Tasks” menu – you may create any scheduled task. For our purpose I’ve created two tasks – one is hourly and second is daily.
Hourly task will give close to real time capability to search for data with Splunk. The reason I wanted to have daily task as well (which essentially duplicates amount of data) is that TeaLeaf server might miss hourly extractions due to restart or maintenance. In case of TeaLeaf reporting server restart, data will still be cached and saved within TeaLeaf Health Based Routing servers or Packet Capture Appliance Servers. Daily extraction will run nightly and will contain the full set of data, while hourly backup will allow close to real time access to log data.
I setup TeaLeaf to export daily logs and hourly logs into separate locations and for Splunk store them into separate indexes (database locations). This way I can cleanup hourly data (mostly duplicate of daily) on a regular basis to free disk space and keep daily logs for as long as I want.
Once you’ve defined your log extraction schedule – check “CX Servers” tab – likely you need to select all servers (TeaLeaf canisters) listed there.
Next tab – “Data Set”. Nothing is checked there. We do want to include single hit sessions as these are quite often comes from malicious sources.
I didn’t risk to volunteer any smartness toward “Enable Custom Search String” and “Custom Search String appears on the same page” options – so I left them alone.
“Data Filters” tab -> “URL” should include all, unless you know what you’re doing.
And same is for “HTTP status code” – have it inĀ “Include All” way.
This is very valuable information for any purpose – marketing, business or security – and I have no clue why would IBM add an option to exclude it.
“Data Filters” -> “URL Fields” is actually very useful. It gives you a chance to reduce your extracted logs by 50% or more by eliminating lots of possibly useless data.
In my client’s case there were plenty of front end eventing and front end state data that was generated by javascripts and wouldn’t carry much value for security investigations.
You are able to specify comma-delimited fields to exclude from logs. These are very specific for every client and for every web application. One of that way to discover these is to leave this field blank, ran extraction task for a short period of time and then see which data are not needed and adjust these settings accordingly.
“Data Filters” -> “cookies” deserves a bit of a special attention. In first incarnation of my system I excluded them all.
Currently I considering updating this specific setting to include extra tracking capabilities.
Cookies contain valuable browser-specific persistent data and having selective access to this piece of information can add critical piece in the toolkit of a smart security investigator.
I won’t elaborate much here but I welcome you to study your raw TeaLeaf data to see if there is anything in there that would make sense to be included. Start with “Include All” and see if that would give you some benefits.
For “Data Filters” -> “appdata” I included all.
You may want to look carefully at “Data Filters” -> Event ID. I’ve chosen “Exclude Specific” to avoid
unnecessary clutter. Enterprises might have hundreds of events created and we want to pass event values to the exported log files. Although quite often many events are redundant or created for specific test cases – and these can be safely excluded.
The last important step in configuring TeaLeaf data exporting is “Destination” tab. You specify output directory that could also be UNC path like \\server\name\dir
TeaLeaf will create subdirectories for each year/month/day where actual logs will be created.
“Max Log File Size” is in megabytes. I haven’t played with “Concurrent logs” value, possibly setting it above 1 would increase a speed of daily logs. I chosen “delimiter” to be pipe (|) as it looks like the best looking delimiter for me. Using space or tabs for delimiters is making life hard.
“Populate cs-uri-query with TeaLeaf Events” is the only reasonable way to inject TeaLeaf eventing data (such as usernames, POST variables, session variables, respond fields, etc) into the logs.
If you’d check “Tealeaf events as individual entires” – you’ll be ending up with absolutely huge and messy looking log files.
“Populate cs-uri-query with TeaLeaf Events”will make your query string field look like this:
“some=stuff” – real query string value before “Populate cs-uri-query with TeaLeaf Events” was selected.
“123=hello&234=world&some=stuff” – query string value after “Populate cs-uri-query with TeaLeaf Events” was selected.
Where ‘123’ and ‘234’ are Event ID’s and “hello” and “world” are event values defined for these events.
For example if you have event that is tracking username of successfully logged in user and this event ID is ‘456’, then you’ll see this:
“123=hello&234=world&567=johnsmith&some=stuff”
Above is really clunky and messy way to transfer TeaLeaf event data to logs but I didn’t find any better way. Why IBM TeaLeaf does not allow creation of extra “event data” field that would store only eventing data and instead decided to shove all this stuff into real query string is beyond me. Additionally to all above I discover unfixable annoying bug: if your real query string contains field that starts with ‘d’, such as ‘data’ – you’ll not be able to exclude it via “Data Filter” -> “URL fields” -> “Exclude specific”. Even if you add it there – it’ll not be removed.
So when your real query string will look like this:
“data=12345&some=stuff”, then TeaLeaf will create this: “data=12345567=johnsmith&some=stuff” – therefore corrupting previous value. I didn’t find a way to make TeaLeaf insert ‘&’ character. Just keep that in mind.
Prepare for really long and messy looking query strings in output logs, but all the good data will be all there.
One of the last steps is to select the fields that you actually need within the logs. After multiple iterations – I ended up with fields selected on the images above.
I disagree with TeaLeaf’s field naming convention – why field name should include brackets and unpredictably mixed cases? – but luckily Splunk is smart enough to parse and index messy things like these.
Final setting – specify notification email about log task completion. These may get annoying after receiving email every hour – the choice is yours.
We done! – once the task is saved and you made sure that your cxConnect task is marked as active – it will function by itself and start creating logs. You may then duplicate this task – say if this is hourly task and you want to also create daily/nightly log extracts like I did. Just duplicate this one, do small adjustment top schedule, make sure to also make adjustment at “Destination” -> “Log Files” -> “Log Directory”. It’s good idea to keep hourly and daily logs in a separate directories.
Connect with me on LinkedIn
Gleb Esman is currently working as Senior Product Manager for Security/Anti-fraud solutions at Splunk leading efforts to build next generation security products covering advanced fraud cases across multiple industry verticals.
Contact Gleb Esman.