User Behavior Analysis with SplunkBack in my days at IBM T.J. Watson Research Center where we were working on techniques to detect known and unknown malware, the fast growing challenge was the rising threat of malware’s abilities to become polymorphic.

Malicious snippets of code encrypted themselves and made it very difficult to apply conventional signature based detection techniques.

We’ve developed a tiny virtual machine in C language that was able to load malware code in real time and analyze it’s behavior without need to figure out how to decrypt it. Certain score metrics were assigned to keypoints and function calls and logic was put in place to trigger an alert if “risk” score exceeded certain heuristic threshold.

That technique allowed us to deliver top quality enterprise security solution (purchased by Symantec later on) that was capable of detecting previously unknown threats. That was more than 15 years ago.

While working with financial clients and technology companies today I can see that old behavior pattern analysis stays as strong as ever helping enterprises to discover new types of suspicious behaviors and investigate malicious activities with previously unknown patterns from previously unknown sources.

Industry leaders seems to agree that some of the recent high profile breaches could of been thwarted with properly configured behavior analysis SIEM system in place.

With attackers and fraudsters changing their approaches and techniques daily – the behavior analysis based solutions seems to be the most promising in offering detection and protection layers against previously unknown threats.

The advantage of Splunk in tackling such a complicated tasks is that it’s very opened and scalable framework.

You don’t need to “pay extra to enable feature A, B and C” (as usually the case with many appliance-based offerings) and you don’t need to hire army of vendor consultants to tweak their solution to your business specifics.

In fact in my experience I helped clients to use Splunk’s free license to successfully detect, investigate and eliminate very nasty malware cyber attacks on web application hosting services.

I. Capabilities of user behavior analysis systems

Modern user behavior analysis systems generally encompass the following capabilities:

  1. They are able to sessionize user activity, in other words – group isolated hits and events coming from possibly different sources into clusters of activities driven by the same user.
    This is very important steps that adds identity metadata to every hit and event and allows for further analysis and establishing of baselines of typical user behavior.
  2. They offer risk scoring approach where certain scores are assigned to significant events (money transactions, securities trading, account updates, password changes).
    Scores may also be assigned to more complex event dependencies such as order of appearance, timing in between events,  “rush” factor into money moving transactions, and others.
    Summary risk score is calculated automatically and security alerts are issued when risk score threshold is exceeded.
  3. They offer automated machine learning approach where system-wide baseline of behavior is automatically calculated and alerts are issued when user session activities exhibits strong enough deviation from established baselines.

In this post I will show you how to use Splunk to implement all three of the above.

I will make this post somewhat cater to financial services sector, securities trading and e-commerce type of enterprises. These are usually high profile targets for attacks where breaches can cause significant monetary and trust factor damages.

II. Sessionizing user activity and implementing risk score based alerting

In the first part of this post I’ll show how to setup alerts when suspicious activity is detected within active online financial application. This will cover points #1 and #2 of “Capabilities of user behavior analysis systems” above.

In the second part I’ll cover point #3.

Just like in earlier posts we need to make a set of assumptions. You’ll be able to substitute names to your specifics later on if wanted to.

  1. Lets assume you have your WEB traffic logs with all the event data coming into Splunk.
    All web events are located within the index named: logs.
  2. Field names (or aliases):
    1. HTTP request method (GET, POST, HEAD, etc..): method
    2. Session tracking cookie (such as: ASP.NET_SessionId, PHPSESSID or JSESSIONID): session_id
    3. URL of page accessed: page
    4. Referrer for each web page hit: referer
    5. Username field: username
    6. IP address of visitor: ip
    7. USER_AGENT value: ua
    8. Name of website: site (Could be: www.your-bank.com or www.your-brokerage.com)
  3. Data coming into Splunk in real time.
    Note: If you have users activity traffic data coming in with delays (on a scheduled basis) the alerting schedules will need to change slightly to accomodate for that.

Here’s what needs to be done:

  • Incoming traffic activity will be grouped into sessions (transactions in Splunk terms).
  • Risk score will be assigned to specific events.
  • Each session will be scanned for the presence of high risk events and risk score will be updated.
  • Email alert will be issued automatically if any session is detected with score exceeding predetermined threshold.

And here’s the beauty of Splunk: we can accomplish all above with one search in one step of creating scheduled alert:

  1. From Splunk menu, select: Settings -> Searches, reports and alerts
  2. From App context, pick the App you want to create alert in, or “Search & Reporting” if you don’t have custom app.
  3. Click [New] button.
  4. Fill in the form according to this image:

 

user-behavior-analysis-alert-1

Once saved – this search will be triggered every 5 minutes (cron setting: */5 * * * *).
The actual search (see the source code below) does the following:

  1. Scans 1 hour worth of most recent data and sessionize is by session_id cookie value using Splunk transaction command.
    In this example we consider “session” a logical set of hits originated from a single user (bound together with the help of session_id cookie) that either ends with the hit to “Logout” page or when more than 15 minutes of inactivity is detected.
  2. Along the way it constructs xpages field that converts WEB page URL from simple value like:
    /Welcome.aspx  to enriched value like this:
    [2015-05-12 18:59:55] [GET] [200] [www.your-site.com] /Welcome.aspx
    This representation will help down the road to examine sequence of pages user accessed, timing and metadata of each page hit visually and programmatically. Transaction command is instructed do *not* deduplicate xpages field (via mvlist=xpages parameter).
  3. Evaluate business-specific risk scoring for each session.
  4. Filters low risk score sessions
  5. Emits all sessions that exceeded behavior risk score threshold. If that happened – email alert will be sent immediately with all the necessary information about suspicious sessions.

Here’s the complete source of alert query:

index=logs earliest=-1h
| eval mytime=strftime(_time,"%Y-%m-%d %H:%M:%S")
| eval pages=page
| eval xpages = "["+mytime+"] ["+method+"] ["+code+"] ["+site+"] "+page
| transaction session_id maxpause=15m endswith=(page=*/logout*) maxevents=-1 maxopentxn=500000 maxopenevents=20000000 keepevicted=1 mvlist=xpages
| eval usernames=username
  | eval username_lf=lower(mvindex(usernames, 0))
  | eval username=username_lf
  | eval username_ll=lower(mvindex(usernames, -1))
  | eval username=if(len(username)>0, username, "")
| eval ips=ip 
  | eval ip=mvindex(ips, 0)
| eval referer = mvindex(referer, 0)
| eval uas=ua 
  | eval ua=mvindex(uas, 0)
| eval sites=site
  | eval site=mvindex(sites, 0)

| where eventcount>=5
| eval riskscore=0
| eval riskmsg=""

| eval test=mvfind(xpages, "(?i)\[POST.*?/fundstransfer") | eval addscore=10
  | eval riskscore=if(isnull(test), riskscore, riskscore+addscore) 
  | eval riskmsg=if(isnull(test), riskmsg, riskmsg+"|(+"+addscore+") Money movement detected")
  | eval addscore=15 | eval riskscore=if(isnull(test) OR test>=6, riskscore, riskscore+addscore) 
    | eval riskmsg=if(isnull(test) OR test>=6, riskmsg, riskmsg+"|(+"+addscore+") Immediate Money movement detected")

| eval test=mvfind(xpages, "(?i)\[POST.*?/updateuserprofile") | eval addscore=15 
  | eval riskscore=if(isnull(test), riskscore, riskscore+addscore) 
  | eval riskmsg=if(isnull(test), riskmsg, riskmsg+"|(+"+addscore+") Profile edit detected")
  | eval addscore=15 | eval riskscore=if(isnull(test) OR test>=6, riskscore, riskscore+addscore) 
    | eval riskmsg=if(isnull(test) OR test>=6, riskmsg, riskmsg+"|(+"+addscore+") Immediate Profile edit detected")

| eval test=mvfind(xpages, "(?i)\[POST.*?/updatepassword") | eval addscore=20
  | eval riskscore=if(isnull(test), riskscore, riskscore+addscore) 
  | eval riskmsg=if(isnull(test), riskmsg, riskmsg+"|(+"+addscore+") Password update detected")

| eval oldscore=riskscore | eval test=mvfind(xpages, "(?i)\[POST.*?/(stock|options)tradeorder") | eval addscore=10
  | eval riskscore=if(isnull(test), riskscore, riskscore+addscore) 
  | eval riskmsg=if(isnull(test), riskmsg, riskmsg+"|(+"+addscore+") Security Trading detected")
| where riskscore>=45
| makemv delim="|" riskmsg
| iplocation allfields=1 ip | eval Country=if(Country="United States", "USA", Country) | eval Location=Region+" - "+City
| table _time, eventcount, duration, ht, ip, Country, Location, username, riskscore, riskmsg, xpages
| rename ht as "Avg seconds/hit", ip as IP

Here are step by step explanations:

index=logs earliest=-1h
| eval mytime=strftime(_time,"%Y-%m-%d %H:%M:%S")
| eval pages=page
| eval xpages = "["+mytime+"] ["+method+"] ["+code+"] ["+site+"] "+page
| transaction session_id maxpause=15m endswith=(page=*/logout*) maxevents=-1 maxopentxn=500000 maxopenevents=20000000 keepevicted=1 mvlist=xpages

New field mytime is constructed to create human readable time of hit. Also new, xpages field is constructed to contain extra page hit metadata, such as human readable time, HTTP request method, response code, site name (in case your log receives traffic hits from multiple sites/subdomains).

Transaction command is set to contain unlimited events number, terminate either upon reaching logout page or by being idle for more than 15 minutes. It is also instructed to keep all records (do not deduplicate) within xpages field.
So as a result, when alert happens xpages field will contain multivalue sequence of hit belonging to session like this:
splunk-session-behavior-risk-score-alert

| eval usernames=username
  | eval username_lf=lower(mvindex(usernames, 0))
  | eval username=username_lf
  | eval username_ll=lower(mvindex(usernames, -1))
  | eval username=if(len(username)>0, username, "")
| eval ips=ip 
  | eval ip=mvindex(ips, 0)
| eval referer = mvindex(referer, 0)
| eval uas=ua 
  | eval ua=mvindex(uas, 0)
| eval sites=site
  | eval site=mvindex(sites, 0)

Once transaction finished – new fields are created: username_lf (username, lowercase, first),  username_ll (username, lowercase, last). Two usernames are needed in case when transaction contain more than one username. This could happen if user is managing multiple subaccounts or has forgotten (or mistakenly typed) his username. We need to keep all data within the session to allow for full visibility down the road. Transaction creates multivalue field for most fields and we building plural field name for multivalue fields and also create singular field name for first occurence within multivalue field.

We do not need multivalue field for referers as we are only interested in the very first referer.

| where eventcount>=5
| eval riskscore=0
| eval riskmsg=""

After sessions are extracted we want to throw away tiny sessions with less than 5 hits (this may need to be adjusted for specifics of your business). New empty fields riskscore and riskmsg are created.

Riskscore will be automatically calculated depending on specific behaviors detected within the session and riskmsg will contain explanatory information to describe the reason for given session was assigned a higher risk value.

| eval test=mvfind(xpages, "(?i)\[POST.*?/fundstransfer") | eval addscore=10
  | eval riskscore=if(isnull(test), riskscore, riskscore+addscore) 
  | eval riskmsg=if(isnull(test), riskmsg, riskmsg+"|(+"+addscore+") Money movement detected")
  | eval addscore=15 | eval riskscore=if(isnull(test) OR test>=6, riskscore, riskscore+addscore) 
    | eval riskmsg=if(isnull(test) OR test>=6, riskmsg, riskmsg+"|(+"+addscore+") Immediate Money movement detected")

| eval test=mvfind(xpages, "(?i)\[POST.*?/updateuserprofile") | eval addscore=15 
  | eval riskscore=if(isnull(test), riskscore, riskscore+addscore) 
  | eval riskmsg=if(isnull(test), riskmsg, riskmsg+"|(+"+addscore+") Profile edit detected")
  | eval addscore=15 | eval riskscore=if(isnull(test) OR test>=6, riskscore, riskscore+addscore) 
    | eval riskmsg=if(isnull(test) OR test>=6, riskmsg, riskmsg+"|(+"+addscore+") Immediate Profile edit detected")

| eval test=mvfind(xpages, "(?i)\[POST.*?/updatepassword") | eval addscore=20
  | eval riskscore=if(isnull(test), riskscore, riskscore+addscore) 
  | eval riskmsg=if(isnull(test), riskmsg, riskmsg+"|(+"+addscore+") Password update detected")

| eval test=mvfind(xpages, "(?i)\[POST.*?/(stock|options)tradeorder") | eval addscore=10
  | eval riskscore=if(isnull(test), riskscore, riskscore+addscore) 
  | eval riskmsg=if(isnull(test), riskmsg, riskmsg+"|(+"+addscore+") Security Trading detected")

This is one of the most important part of the search where we assign risk score to session depending on the activity detected within the session. In this example I want to assign higher risk score to the following actions:

  • Any money transfer action
  • User profile update action
  • User password update action
  • Securities trading action

I also assign a higher risk score to session where money transfer or user profile update happens rather quickly close to the very beginning of session (within the first 6 hits). This is to account for typical fraudsters behavior where malicious activity usually happens quickly after successful login. This metric might need to be adjusted for your specific business application.

Most account updates and money transfer requests are initiated via HTTP POST method and regular expression takes this into account.

Also in this example I assume that URL of password update page is: /Secure/Account/UpdatePassword.aspx, and thus if we want to detect POST request to this page – the matching regular expression is: “(?i)\[POST.*?/updatepassword”

Every time page hit is detected to higher risk page – session risk score is updated as well as riskmsg text field.

You may add more tests for risky behaviors within the session possibly introducing more complex variables that consider combinations of different factors appearing within certain proximity of each other.

To manage false positives you may also include factors that lowers riskscore such as detection if user session is initiated from familiar location or device. This part is beyond this series as it will require more complex logic.

In the second part I will show you how to use Splunk to automatically establish baseline for normal session duration and normal session hit numbers and create an alerts when too long or too fast session is detected.

Connect with me on LinkedIn
Gleb Esman is currently working as Senior Product Manager for Security/Anti-fraud solutions at Splunk leading efforts to build next generation security products covering advanced fraud cases across multiple industry verticals.
Contact Gleb Esman.