This is Part II of my article that distils the key terminology of data science down into simple explanations (see Part I here). Here are some more topics everyone who wants to learn about data science should have a general understanding of.
The entire collection of data that will be used in a particular data science initiative. In modern, complex big data projects, it can involve many types of data gathered from different sources.
A person who applies the scientific method of observing, recording, analyzing, and reporting results to understand information and use that information to solve problems. You can read more about the skills of a data scientist here.
Democratization of Data Science
The idea that data science tools and techniques are increasingly accessible to a growing number of people, rather than to only those in academia or industry with access to large budgets. See also, Citizen Data Scientist in Part I.
A basic decision-making structure that can be used by a computer to understand and classify information. By asking a series of questions about each data item fed into the decision tree, outputs are channeled along different branches leading to different outcomes, typically resulting in labelling or classification of the piece of data.
Data can be stored in a database that has one dimension (a list) or two dimensions (a grid) made up of rows and columns. It can also be stored in multi-dimensional databases, which can take the form of a grid with three axes, or even more complex permutations, which are not possible to relate to common geospatial objects, thanks to the power of CPU processing. More complex dimensional structures typically allow for more connections to be observed between the data objects that are being analyzed.
A database that is held in a computer’s RAM where it can be accessed and operated far more quickly than if the data is read from a disk whenever it needs to be accessed. This is something that was very difficult to do with large data sets in the past but has become possible in recent years due to the increase in size of available memory and the fall in the cost of physical RAM chips.
Data about data, or data attached to other data. For example, metadata for an image file would be information about its size, when it was created, what camera was used to take it, or which version of a software package it was created in.
A variable where the value is very different from that which is expected considering the value of other variables in the dataset. These can be indicators of rare or unexpected events, or of unreliable data.
Using data to predict the future. Rather than a crystal ball or tea leaves, data scientists use probability and statistics to determine what is most likely to happen next. The more data that is available from past events, the more likely that algorithms can make a prediction with a high probability of proving correct. Predictive modelling involves running a large number of simulated events in order to determine the variables most likely to produce a desired outcome.
Python is a programming language that has become very popular with data scientists in recent years due to its relative ease of use and the sophisticated ways it can be used to work with large, fast-moving datasets. Its open source nature (anyone can add to it or change it) means its capabilities are constantly being expanded, and new resources are becoming available.
A group of objects that have been classified according to similar characteristics and then distributed evenly between a number of such groups. These are distinguished as “quartile” if there are four such groups, “quintile” if there are five such groups, etc. The “first quartile” would refer to the top quarter of entries in a list that has been split into four equal groups.
Another programming language, which has been around for longer than Python and was traditionally the choice for statisticians working with large data sets, is R. Although Python is quickly gaining in popularity, R is still heavily used by data scientists and is commonly taught in data science courses at universities.
A random forest is a method of statistical analysis that involves taking the output of a large number of decision trees (see above) and analyzing them together to provide a more complex and detailed understanding or classification of data than would be possible with just one tree. As with decision trees this is a technique that has been around in statistics for a long time, but modern computers allow for far more complex trees and forests, leading to more accurate predictions.
A common calculation in data science used to measure how far removed a variable, statistic, or measurement is from the average. This can be used to determine how closely a piece of data fits to the norm of whatever it represents (speed of movement, temperature of a piece of machinery, population size of a developed area) and allows inferences to be made on why it differs from the norm.
Don’t forget to check out Part I of this article to see explanations for other key data science concepts and terminologies.
Bernard Marr is an internationally best-selling business author, keynote speaker and strategic advisor to companies and governments. He is one of the world’s most highly respected voices anywhere when it comes to data in business and has been recognized by LinkedIn as one of the world’s top 5 business influencers. In addition, he is a member of the Data Informed Board of Advisers. You can join Bernard’s network simply by clicking here or follow him on Twitter @bernardmarr
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.