# Machine learning

Jump to navigation
Jump to search

- A concise definition of artificial intelligence would be:
*the effort to automate intellectual tasks normally performed by humans.* - Machine learning arises from this question: could a computer go beyond "what we know how to order it to perform" and learn on its own how to perform a specified task?
- In classical programming, rules and data are used to generate answers. With machine learning, data and answers are used to generate rules
- Most machine learning algorithms are not good with perceptual (relating to the ability to interpret or become aware of something through the senses) problems, such as seeing and hearing, which are problems that involve skills that seem natural and intuitive to humans

# Brief history

Probabilistic modeling (Naive Bayes) -> logistic regression -> early neural networks (backpropagation) -> kernel methods -> decision trees, random forests, and gradient boosting (a way to improve models by iteratively training new models that specialise in addressing the weak points of the previous models) -> deep learning

# Glossary

- Features = independent variables = predictor variables
- Target variable = dependent variable = response variable

Definition of some machine learning terms from the review "Machine learning applications in genetics and genomics":

**Machine learning**: A field concerned with the development and application of computer algorithms that improve with experience.**Artificial intelligence**: A field concerned with the development of computer algorithms that replicate human skills, including learning, visual perception and natural language understanding.**Heterogeneous data sets**: A collection of data sets from multiple sources or experimental methodologies. Artefactual differences between data sets can confound analysis.**Likelihood**: The probability of a data set given a particular model.**Label**: The target of a prediction task. In classification, the label is discrete (for example, ‘expressed’ or ’not expressed’); in regression, the label is of real value (for example, a gene expression value).**Examples**: Data instances used in a machine learning task.**Supervised learning**: Machine learning based on an algorithm that is trained on labelled examples and used to predict the label of unlabelled examples.**Unsupervised learning**: Machine learning based on an algorithm that does not require labels, such as a clustering algorithm.**Semi-supervised learning**: A machine-learning method that requires labels but that also makes use of unlabelled examples.**Prediction accuracy**: The fraction of predictions that are correct. It is calculated by dividing the number of correct predictions by the total number of predictions.**Generative models**: Machine learning models that build a full model of the distribution of features.**Discriminative models**: Machine learning approaches that model only the distribution of a label when given the features.**Features**: Single measurements or descriptors of examples used in a machine learning task.**Probabilistic framework**: A machine learning approach based on a probability distribution over the labels and features.**Missing data**: An experimental condition in which some features are available for some, but not all, examples.**Feature selection**: The process of choosing a smaller set of features from a larger set, either before applying a machine learning method or as part of training.**Input space**: A set of features chosen to be used as input for a machine learning method.**Uniform prior**: A prior distribution for a Bayesian model that assigns equal probabilities to all models.**Dirichlet mixture priors**: Prior distributions for a Bayesian model over the relative frequencies of, for example, amino acids.**Kernel methods**: A class of machine learning methods (for example, support vector machine) that use a type of similarity measure (called a kernel) between feature vectors.**Bayesian network**: A representation of a probability distribution that specifies the structure of dependencies between variables as a network.**Curse of dimensionality**: The observation that analysis can sometimes become more difficult as the number of features increases, particularly because overfitting becomes more likely.**Overfitting**: A common pitfall in machine learning analysis that occurs when a complex model is trained on too few data points and becomes specific to the training data, resulting in poor performance on other data.**Label skew**: A phenomenon in which two labels in a supervised learning problem are present at different frequencies.**Sensitivity**: (Also known as recall). The fraction of positive examples identified; it is given by the number of positive predictions that are correct divided by the total number of positive examples.**Precision**: The fraction of positive predictions that are correct; it is given by the number of positive predictions that are correct divided by the total number of positive predictions.**Precision-recall curve**: For a binary classifier applied to a given data set, a curve that plots precision (y axis) versus recall (x axis) for a variety of classification thresholds.**Marginalization**: A method for handling missing data points by summing over all possibilities for that random variable in the model.**Transitive relationships**: An observed correlation between two features that is caused by direct relationships between these two features and a third feature.

# Clustering

- https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
- https://en.wikipedia.org/wiki/Akaike_information_criterion
- https://en.wikipedia.org/wiki/Bayesian_information_criterion
- https://en.wikipedia.org/wiki/Deviance_information_criterion
- https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory