Below I rattle through a load of Data Science buzzwords (not a deep techie explanation of Data Science) and give short definitions with a finance/accounting example where appropriate.
Here I have taken a number of buzzwords and attempted to group them in the process stage you will encounter them. Note: this is not a linear process, and you will move back and forward to build accurate results.
Feature selection: process to decide which features of the observations will be used to predict the outcome variable. This process can be automated — for example, using a step-wise method or through a lasso regression — or manual, with the practitioner’s domain expertise playing an important role in considering which features to include.
Machine learning: Set of methods whose aim is to automate human tasks so that they can be performed at scale. These tasks will generally imply classifying observations into categories (sentiment, topic, purchase…) or predicting a continuous outcome variable (price, risk, …).
Topic model: common type of unsupervised learning method applied to textual data. It assumes that documents are a mixture of topics, and topics are a mixture of words. It produces a set of probabilities that each document is associated with each topic. The output can be used to identify the general themes or issues present in a corpus of documents.
Classifier: a statistical method used to extrapolate from the training set to the test set in a supervised machine learning problem. It identifies how values of the relevant features predict the value of the outcome variable.
Features: characteristics of a unit that will be used in a machine learning method in order to predict the value of the outcome variable.
Outcome variable: characteristic of a unit that a machine learning method will try to predict.
Training set: set of observations that have been manually annotated or classified by humans. A supervised machine learning method will use this data to identify how different features can help predict an outcome variable.
Test (or validation) set: randomly selected set of observations from the same population as the training set that is left out from the training stage. It is used to evaluate the quality of the classifier in unseen units, offering an approximation of out-of-sample performance.
Supervised machine learning: type of machine learning methods that automate human decisions by identifying the features of a specific unit that make it more likely to be associated with a given outcome category. It requires a training set.
Unsupervised machine learning: type of machine learning method that automatically detect and classify all the relevant categories within a given sample of units. It does not require a training set.
Regularized regression: common type of classifier. It builds upon the standard (linear or logistic) regression model but adds a penalty term to reduce the size of the coefficients. This helps identify the model and improves its out-of-sample performance by reducing overfitting (see below).
Lasso regression: a popular type of regularized regression that considers the sum of the absolute value of the coefficients in the regression as the penalty term. One of its distinctive properties is that features that are not predicted are assigned a coefficient of zero; that is, they are excluded from the estimation automatically. These steps are equivalent to a feature selection process: from all the possible features in the model, the Lasso regression will only keep those that improve its predictive accuracy.
Principal components analysis (PCA): unsupervised learning method that reduces the dimensionality of a feature matrix. Its goal is to combine multiple features into a smaller set of principal components or indices that explain as much of the variation as possible of all variables while reducing its dimensionality. It works best when features are highly correlated (e.g. stock market prices).
K-means clustering: unsupervised learning method that identifies significant and meaningful groups within a dataset. This method is scalable and simple, but requires setting a specific number of groups. The output can be used to create new features within a dataset or as a descriptive tool, for example in applications of consumer segmentation.
In-sample performance: it measures the quality of the predictions generated by the classifier within the training set. It can be very high when there is overfitting. It may not be indicative of the overall performance of the classifier.
Out-of-sample performance: it measures how well the classifier does at predicting the outcome variable for the entire population of units. This is the quantity to maximize when building a supervised learning classifier. It can be difficult to estimate.
Overfitting: problem that arises when a classifier has very high in-sample performance but low out-of-sample performance, which is generally due to an emphasis on identifying patterns that are specific to the training set that may not generalize to the entire population of units.
RMSE (root mean squared error): metric used to evaluate classifiers with continuous outcome variables. It offers an approximation of how far off are the predictions generated from the classifier with respect to the true value.
Confusion matrix: a table (2×2 with a binary outcome variable) that compares the predictions from the supervised learning classifier with the labels from human annotators in the test set. Predictive accuracy will be high when most of the observations fall in the diagonal. It can be used to compute accuracy, precision, recall, etc.
Accuracy: metric used to evaluate classifiers with categorical outcome variables. It can be computed as the proportion of units that are correctly classified, for which human and machine predictions overlap.
Precision: metric used to evaluate classifiers with categorical outcome variables. Unlike accuracy, this metric consists on the ratio of units correctly classified over the units that are predicted to be in a given category.
Recall: metric used to evaluate classifiers with categorical outcome variables. Unlike accuracy, this metric consists on the ratio of units correctly classified over the units that are known (because of human annotation) to be in a given category. There is generally a trade-off between precision and recall.
Cross-validation: procedure used to identify the best parameters of a classified using only a training set. Consists on splitting the training set into a number of folds (random subsamples), building the same number of classifiers, and evaluating them across all folds. The best classifier is then selected based on average performance across all folds.
Semantic validity: metric used to evaluate the quality of a topic model. High semantic validity is determined by whether the emerging topics from the model are semantically coherent (all words have similar or related meanings) and the topics are mutually exclusive (each topic refers to a different issue or theme).
Predictive validity: metric used to evaluate the quality of a topic model. A model will have high predictive validity when variation in topic usage (e.g. over time) is correlated with other variables (e.g. events) in an expected way.
Data Visualisation: visual representation of model data, that clearly explains the outcomes from the data science. Normally forms a dashboard of chart is agreed standards, functionality and usability.