PHY 421: Data Science for Physical Sciences

3 credits | Prerequisites: CSC 101+L, MAT 212

Course rationale

This is an elective course designed for students majoring in physics, mathematics, engineering or computer science. The course intends to give an overview of data science as practiced in various branches of the physical sciences. Different tools of data science and artificial intelligence are introduced and then used for getting answers from big data. Python programming language is a prerequisite.

Course content

Probability theory: scientific inference, Bayes’ theorem, probability and frequency, marginalization, fundamentals of logic, logic functions. Bayesian inference: parameter estimation, nuisance parameters, Occam’s razor, ignorance priors, systematic errors, spectral analysis, Bayesian inference with Posisson sampling. Frequentist statistical inference: sampling theory, probability distributions, discrete and continuous distributions, central limit theorem, chi-squared distribution, sample variance, Student’s t distribution, confidence intervals. Hypothesis testing: testing with the chi-squared statistic, goodness-of-fit test, problems with frequentist hypothesis and Bayesian resolution. Model fitting: Bayesian inference with Gaussian errors, linear model fitting, regression analysis, model parameter errors, correlated data errors; nonlinear model fitting: iterative linearization, Levenberg-Marquardt method. Markov chain Monte Carlo: Metropolis-Hastings algorithm, simulated and parallel tampering, automated MCMC. Neural networks: feed-forward network functions, network training, error backpropagation, Hessian matrix, regularization, Bayesian neural networks. Kernel methods: dual representations, building kernels, radial basis function networks, Gaussian processes. Sparse kernel machines: maximum margin classifiers, support vector machines (SVM) and relevance vector machine (RVM), RVM for regression and classification. Graphical models: Bayesian networks, conditional independence, Markov random fields, inference in graphical models. Mixture models and EM: K-means clustering, mixtures of Gaussians, maximum likelihood, expectation-maximization (EM) for Bayesian linear regression. Approximate inference: variational inference, variational mixture of Gaussians, variational linear regression, variational logistic regression, expectation propagation.

Course objectives

  1. Understand the basics of probability theory and its centrality in hypothesis testing in the physical
    sciences.
  2. Differentiate and use frequentist and Bayesian methods of inference.
  3. Comprehend the fundamentals of linear model fitting and the linearization of nonlinear models.
  4. Apply MCMC algorithms to realistic dataset for uncovering meaningful answers.
  5. Explain the probabilistic basis of artificial intelligence and machine learning (ML).
  6. Understand the basics of neural networks and construct simple ML models.

References

  1. P. C. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, Cambridge University Press, 2005.
  2. C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
  3. E. T. Jaynes, Probability Theory: The Logic of Science, Cambridge University Press, 2003.