データサイエンスとは、Big data and Big issues, Data mining, Crypto currencies, Data visualization, GDPR, Digital Encryption, Artificial Intelligence 等を活用し、データを用いて新たな科学および社会に有益な知見を引き出そうとするアプローチ、データにもとづいて合理的な判断を行い、的確な意思決定を導く手法を意味します。
In this 10 week course, students would learn about commonly used practices by data science professionals. Below is an outline of what students could expect to cover during each week. As this is a highly condensed course, students may also be asked to do extra reading outside of their tutorial sessions in order to practise what they have learnt, and to gain a firm understanding of key concepts.
ケンブリッジ データサイエンス入門オンラインコース(5時間)
What does a data scientist do?
Typical workflow patterns
Tech and software used
Approach to business problems
What is machine learning?
What is a model?
Supervised vs Unsupervised ML
Regression vs Classification
What the typical workflow of a data scientist is, the software and technologies used, including common approaches to business problems.
What machine learning is and an intuitive understanding of what “a model” represents. This will follow with examples of supervised and unsupervised learning, and understanding the differences and use cases for regression and classification models.
Variables:
independent, dependent
continuous vs categorical
Common statistics:
mean, median, mode (when to use which?)
variance, standard deviation
covariance, correlation
What variables are; the difference between independent and dependent variables; continuous, categorical and ordinal variables.
An intuitive understanding of common statistics, such as mean, median, mode (and when to use which), variance, standard deviation, covariance and correlation.
Command line interface (intro)
Github (intro)
Jupyter notebooks (intro and orientation)
Python
Datatypes: int, float, dict, list, str, Bool, tuple, set
Loops (for, while, list comprehensions) and if/elif/else statements
An introduction to the common tech stack used by data scientists, including an introduction to the command line interface, Github version control, and Jupyter notebooks.
An introduction to Python, including the datatypes available (eg int, float, dict, list, str, Booleans, tuples and sets), use of conditional statements (if, elif, else) and iterating using loops (for, while, and list comprehensions).
ケンブリッジ データサイエンス基礎オンラインコース(10時間)
Pandas
Reading in datasets
EDA: .describe(), .shape()
Data cleaning techniques
Groupby, value_counts
Data Visualization
Seaborn gallery
Matplotlib
Intro to personal project: Ames Housing Dataset
Common data wrangling practices using the Pandas library (reading in data sets, EDA techniques, and data cleaning).
Common data visualization libraries such as Matplotlib and Seaborn.
Introduction to the personal project (this is a machine learning project to be completed by the individual and to be presented during their session in week 10 for feedback from the tutor).
Intro to scikit-learn
Regression
Simple linear regression
Interpretation of linear regression results (coefficients: gradients, intercept)
Multivariate regression
Standardisation: StandardScaler(), MinMaxScaler()
Performance metrics: OLS, RSS, TSS, R^2
Introduction to scikit-learn and some of the regression models available.
Fitting of regression models and interpretation of their coefficient results.
Common data pre-processing techniques such as standardisation and why this is important.
Evaluation metrics of regression models.
Generalization
What is overfitting?
Bias-Variance tradeoff
Regularization: Ridge, Lasso, ElasticNet
K-fold Cross validation
The concept of generalization, including an understanding of what overfitting is, and the Bias-Variance tradeoff. Regularization techniques in order to address overfitting, and evaluation metrics such as cross validation.
The problem with “accuracy” for imbalanced datasets
Fitting of common models for classification problems, including their evaluation metrics, use of Grid Search to computationally iterate on “best” parameters. Use of performance metrics such as the confusion matrix.
Exploring the meaning of constituent parts of the confusion matrix (true positives, true negatives, false positives, false negatives) and how to calculate accuracy, precision and recall.
Considerations that need to be made when handling highly imbalanced datasets.
ケンブリッジ データサイエンス本科オンラインコース(15時間)
Databases: SQL
Ensemble models:
Decision Trees
Bootstrap Aggregating (Bagging)
Random Forest
Boosting
An intuitive understanding of what Ensemble models are and why performance is often improved; this includes decision trees, bootstrap aggregating, random forest and boosting models.
Unsupervised learning:
Clustering, k-means
DBSCAN
Time Series Analysis
Autocorrelation
ARIMA, SARIMA
Introduction into unsupervised learning such as clustering techniques.
Time series analysis, such as ARIMA models.
Presentation of personal project: Regression project on Ames Housing Dataset
Presentation of personal project, including feedback and further discussion with the tutor.
1対1の個別レッスンのコースのテューターは、それぞれG5 Universities (オックスフォード大学、ケンブリッジ大学、ロンドン大学(London School of Economics、Imperial College London、King’s College Londonなど)の大学学位取得者で教授法の豊かな経験をもっています。主にオックスフォード大学とケンブリッジ大学の PBL(Project-based learning)テュートリアル教育、課題に立脚しながら学修を進める方法、課題に基づく学修アプローチを実践します。