Network Security Through Data Analysis: Building Situational Awareness
by Michael Collins

O'Reilly 2014.
ISBN ISBN 978-1-449-35790=0 . Amazon.com USD 27.74

Reviewed by  Richard Austin   3/12/2014 

Table of Contents

We are drowning in data logs from our network infrastructure, security devices, servers, etc.. They vomit potentially relevant data in shocking volumes that challenge our ability to merely collect and store it. We suspicion that there is much useful information in there, somewhere, if only we could retrieve it, organize it and present it in a timely, actionable form. And, sad to say, it is becoming too common for post-breach investigations to reveal that the affected organization had information that would have enabled detecting the breach and mitigating its severity if only they realized they had it and could have acted on it in a timely fashion.

Collin's new book is a worthwhile contribution to the continuing conversation on how we can profitably make use of the wealth of data available to us in developing that elusive awareness of what is going on around us (situational awareness). Collins organizes his presentation into three logical phases: data, tools and analytics.

The "Data" section provides a good walkthrough of the different types of sensors and the logic that governs their placement. He provides sound advice on a critical consideration in sensor placement: determining what a given sensor can "see" (vantage) and how to avoid placing multiple sensors with overlapping vantage into the same data. I was glad to see coverage of NetFlow data as it is commonly overlooked though readily available from many modern networking devices. Chapter 4 ("Data Storage for Analysis: Relational Databases, Big Data and Other Options") was a bit of a disappointment in its coverage of "big data". Collins provides some tantalizing hints but I would have appreciated more detail on what contributions to expect from "big data" technologies.

The "Tools" section covers "SiLK" and "R" in addition to more familiar tools. "SiLK" is a tool for working with NetFlow data, while "R" is a full function statistical package. Both are available without cost. SiLK allows the analyst to start working with NetFlow data to develop an appreciation for its value without the necessity for justifying budget, etc., for purchasing one of the commercial packages. And. while many security professionals quickly run for the nearest exit when "statistics" are mentioned, a full-function statistical package is a valuable addition to your toolbox when searching for meaning in data. Collins provides a gentle introduction to R that prepares the way for its more extensive use in the "Analytics" section. I particularly recommend Chapter 7 (Classification and Event Tools) for its discussion of event detection as a problem in binary classification with consideration of "Receiver Operating Characteristic" curves and the base rate fallacy. The words may be long but these are important concepts in understanding why systems such as IDS's so often fail to meet our expectations.

With a good grounding in "Data" and "Tools", Collins turns his attention to the real meat of the matter in the "Analytics" section. One of the many advantages in working with a full-function statistical package versus a set of canned vendor displays is the ability to dive into the data and do exploratory data analysis (EDA). Collins provides great coverage of how this works and gained many kudos for covering a pet peeve of mine which is distributional assumptions ("Why did you assume Gaussian? Because it's the normal assumption!"). How many times have we been assured that 68% of packet sizes, etc., fall within one standard deviation of the mean? That would be true if our packet sizes did indeed follow the Gaussian distribution but do we know that? Collins recognizes the problem and covers using quantile-quantile plots to validate that our assumed distribution actually matches the data.

Visualization is a powerful component of EDA and Collins provides many topical illustrations of how to productively visualize data when searching to understand its meaning and identify patterns (e.g., what traffic volume is really normal and what is anomalous?). As can be seen from the table of contents, Collins provides wide coverage of the different types of analysis which can be applied to gain insight from the data. His presentation on graph analysis (Chapter 13) highlights an important technique for using the connectedness of nodes as an indicator of their importance in events (e.g., when investigating a malware outbreak, the most connected of infected nodes is a good candidate for "patient 0" and a good starting place to determine how the malcode got into your environment).

This book is an excellent introduction to the cornucopia of techniques that can be profitably applied in searching for understanding in the vast array of data our infrastructures produce. Collins introduces the techniques through solid examples and clearly explains what the techniques do, their limitations and how an analyst would actually use them in practice. If I had a criticism of the book, it would be that it is too short! However, the chapters include a good set of references that will guide further study. Definitely a recommended read for the technical security professional.

Special thanks to the kindly reader who suggested this book for review.


It has been said "Be careful, for writing books is endless, and much study wears you out" so Richard Austin (http://cse.spsu.edu/raustin2) fearlessly samples the wares of the publishing houses and opines as to which might most profitably occupy your scarce reading time. He welcomes your thoughts and comments via raustin at ieee dot org