Cyber Security

Luke Hally

Intro to Data

October 26, 2021
Tags:

The start of a new course this week, Data Security and Privacy. I’m looking forward to this one, we learn how to actually hack into real computers in this course and I’ll be exploring the security of biometric authentication for my project.

First up is a look at data. Data is valuable, apparently all the data in the world is worth more than all the oil in the world. and it comes in many forms: code, trade secrets, book content, processes/procedures, customer/supplier lists, financial records. We even have data about data – meta data. The internet and the hyper connectivity that has stemmed from it has enabled a significant (understatement) growth in data collection.

Big Data

Big data is one of those things we’ve all heard about, but it doesn’t really have a clear definition. It is a foundational technology, which means that it enables progress and innovation in multiple domains, Big Data is critical for :

  • Artificial Intelligence (AI) and Machine Learning (ML)​
  • The Internet of Things, Industrial Internet of Things, Industry 4.0

So how do we define it? It’s something that is used by data-led businesses; It’s the collection and analysis of large and complex datasets;​ It draws insights from multiple sources; or very simply: it is too large to store on one computer.

The 5 Vs of Big Data

VolumeBig Data needs a lot of data, applications such as machine learning need it to obtain the variety of data it needs to recognise patterns and to learn.
VelocityThis is the rate at which the data is gathered. Data is not static and is always growing.
VeracityHow accurate is the data? It needs to be as error free as possible. A lot of data management comes down to pre-processing – making sure the data is workable – so structured data is best.
VarietyHow complex is the data? Where is it from? How structured is it?Structured: the data will follow rules on how it is created and content is validated. An example is when you enter your address and choose from a list of options. The address has the same structure every time and the content is validated against a source of truth..Semi structured: The data has rules for input but no validation. For example a form may have fields for different parts of your address, but it’s up to you to type it in. This could lead to spelling mistakes, wrong postcodes etc. The address has structure, but may contain errorsUnstructured: There are no rules or validation in place. For example, to collect my address a website would just have a free text field for me to type in.
ValueHow much is the data worth? This is very contextual. It may have a monetary value or it may be required to undertake money making activities. There are companies collecting data in the hope that they can use it, but still don’t know how to. In these cases, data is a burden because they need to protect and store it, maybe process it but aren’t getting any value out of it.

Big Data Vs Business Intelligence

These terms are sometimes used interchangeably, but they look through opposite ends of the telescope. Big data is data-led, what is the data telling us? This is interesting because it can give us answers to questions that we might not have asked, such as retail data that reveals that men who buy nappies also buy beer. This can then lead to ideas about how to get these men to spend more, such as putting the nappies and beer at opposite ends of the store or a nappy, beer and profit making item bundle.

Business Intelligence is inquiry led, asking questions such as what is the gender of people who purchased these nappies?

Note that we still have to ask questions of Big Data, but they are more general, such as what commonly co-purchased items?

Data Lifecycle

The data lifecycle is an understanding of data from its creation through to its destruction. It helps us understand why it was created, what is collected, how it is stored and its pathway to destruction. It is useful in security, because it lets us consider security at each stage of the lifecycle. Data lifecycles present challenges in that they are often very context specific and there are many to choose from. We will be looking at data lifecycles in more detail this week.

Data, Privacy and Cyber Security

As more and more organisations become data led, or at least data collectors, the data is repurposed. And this isn’t going to change any time soon, the trend is to collect data, so we need to keep it secure from a variety of threats, including:

  • Theft
  • Ransomware
  • Leaks
  • Data poisoning
  • Used in future attacks against us or others.

When it comes to privacy, as we learned in Foundations of Cyber Security and Intro to Security Engineering, privacy is an emerging field. At the moment utility usually wins over privacy. There are a few reasons for this:

  • Privacy has no measurable immediate benefit
  • Privacy is hard to define, quantify and make a case for
  • THere is often a view that “if we don’t someone else will”

We also generate a lot of data in undertaking security, logs, meta etc. We need to keep this safe.

Meta Data

Metadata is data about data. It could be within the data itself or separately in the file system. An example is a phone call. The call itself is the data. Time of call, duration, phone numbers involved are metadata. Sometimes the metadata can be enough to infer real knowledge, for example, knowing that a certain person made a call to someat for this long at this time. Some examples of metadata are:

  • Descriptive – what the data is about
  • Structural – how the data is structured (maybe a schema)
  • Preservation – integrity checks (hashes), rights management
  • Provenance – details on the data’s origin, changes
  • Admin – copyright, ownership
  • Use – how it’s been used, when it was accessed.

Recent posts