Data science sits at the core of any analytical exercise conducted on a big data or Internet of Things (IoT) environment. Data science involves a wide array of technologies, business, and machine-learning algorithms. The purpose of data science is not only to do machine learning or statistical analysis, but also to derive insights out of the data that a user with no statistics knowledge can understand.
In a fast-paced environment such as big data and IoT, where the type of data might vary over the course of time, it becomes difficult to maintain and re-create the models each and every time. This gap calls for an automated way to manage the data-science algorithms in those environments. The rise of data science was intended to move us away from a rules-based system to a system in which a machine learns rules for its automation by itself. Machine learning makes data science inherently partially automated. The half of data science that requires manual intervention is still to be automated. However, those are areas that involve the experience and wisdom of a people: a data scientist, a business expert, a software developer, a data integrator, everyone who currently contributes to making a data-science project operational. This makes it difficult to automate every aspect of data science. However, we can think of data science automation as a two level architecture, wherein:
– Different data science disciplines/components are automated
– All the individual automated components are interconnected to form a coherent data-science system
We can think of a data-science system as automated when it’s capable enough to solve our problem whenever we throw a data set at it. Also, it should be intelligent enough to provide us with all possible solutions in a language that we can understand.
Data preparation, machine learning, domain knowledge, and result interpretation are four major tasks required to execute a data-science project successfully. All these tasks have to be converted to automated modules to create an automated data-science system (Figure 1).
Data Preparation Automation
Data preparation is a repetitive task that has to be done every time when creating models. Data extraction, data cleaning, and data transformations such as imputing null values and algorithm-specific transformations are some tasks that fall into this category. Many organizations automate these tasks and have branded the engine as a data science automation tool. However, most of these tools use rule-based logic for automating data-preprocessing tasks. Is this the right approach? Do we need rule-based systems to automate data science, which was born to end rule-based systems? Well, No. We need data preprocessing automated by machine learning itself. For example, the decision regarding what preprocessing function has to be applied on the data for a problem is to be made by machines themselves.
Feature engineering is another area of data preparation that requires automation. Feature engineering is a technique to convert raw data into attributes/predictors that improve the accuracy of a machine-learning project. Feature-engineering automation is still at a nascent stage and an active area of research. Data scientists from MIT are making incredible progress toward developing a “deep feature synthesis” algorithm capable of generating features from raw data.
Automated Machine Learning/Statistician
This is an area of data-science automation where statistical routines are automated. The system executes the best algorithm based on the provided data set. It hides the intricacies and mathematical complexity of algorithms from the user, making it available to the masses. The user needs to provide the automated statistician with data. It understands the data, creates different mathematical models, and returns the result based on a model that best explains the data. Automated Statistician is a complex science, as it requires the system to learn the input data patterns, find the best-fit values, and self-optimize its parameters using several statistical and machine-learning algorithms. This requires the generalization of various algorithm constraints and enormous computing power.
Automated machine learning is gradually maturing by leveraging cloud-based servers to manage the requirement for high computational power. Organizations creating data products are progressively including features such as meta learning, a process of automatically selecting a suitable machine-learning algorithm based on the metadata of the data set. Organizations like H2O.ai are generalizing the model-building process by introducing several built-in functionalities and providing many model-tuning options which deliver greater control over the algorithms. Also, they have introduced hyper parameter tuning as a feature in almost all of their algorithms to make data scientists free from the cumbersome process of testing the models with different parameters. Hyper parameter tuning is a process to automate the trial and error of re-running the machine-learning models several times to decide the appropriate parameters for a model on a data set.
Insights Generation Automation
Results out of a data-science project are not useful until and unless a business user or an audience with no statistics knowledge comprehends it. The cream of a data science activity is the storytelling part, where a data scientist explains the results to people in a comprehensive and transparent method. Automating this task requires generating user-friendly texts automatically from the statistician-friendly results. Natural Language Generation (NLG) is the current frontrunner framework that can help translate machine language into a natural language. Nlgserv and simplenlg are two NLG frameworks that we can use for this task. Also, we can use Markov chains for generating sentences and for manufacturing stories automatically.
Innovation for data-science automation has begun and will evolve gradually in years to come. We are currently at a stage where we have begun to tackle automation for individual data-science modules. From here, we need to move to a more generic data-science platform, with all modules automated and integrated together. This is how a change starts, just like the way room-sized computers were transformed to credit-card sized computers.
Sibanjan Das is Business Analytics and Data Science consultant. He has over six years of experience in IT industry working on ERP systems, implementing predictive analytics solutions in business systems and Internet of Things. An enthusiastic and passionate professional about technology &; innovation, he has the passion for wrangling with data from early days of his career. He also enjoys reading, writing, and networking. His writings have appeared in various Analytics Magazines, and Klout has rated him among the top 2% professionals in the world talking about Artificial Intelligence, Machine Learning, Data Science and Internet of Things.
Sibanjan holds a Master of IT degree with a major in Business Analytics from Singapore Management University and is a Computer Science Engineering graduate from Institute of Technical Education and Research, India. He is Six Sigma Green Belt from Institute Of Industrial Engineers and also holds several industry certifications, such as OCA, OCP, CSCMS, and ITIL V3. Follow him on Twitter: @sibanjandas.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.