Please enable javascript in your browser.
Page
of
0
Data science and predictive analytics biomedical and health applications using R
,
Book
Dinov, Ivo D
Springer International Publishing AG
2018
Cataloging brief
Data science and predictive analytics biomedical and health applications using R
,
Book
Dinov, Ivo D
Springer International Publishing AG
2018
Find in content
sort by
page number
page score
Bookmark
Foreword
(6)
Preface
(10)
Genesis
(10)
Purpose
(11)
Limitations/Prerequisites
(11)
Scope of the Book
(12)
Acknowledgements
(13)
DSPA Application and Use Disclaimer
(14)
Biomedical, Biosocial, Environmental, and Health Disclaimer
(15)
Notations
(16)
Contents
(17)
Chapter 1: Motivation
(33)
1.1 DSPA Mission and Objectives
(33)
1.2 Examples of Driving Motivational Problems and Challenges
(34)
1.2.1 Alzheimer´s Disease
(34)
1.2.2 Parkinson´s Disease
(34)
1.2.3 Drug and Substance Use
(35)
1.2.4 Amyotrophic Lateral Sclerosis
(36)
1.2.5 Normal Brain Visualization
(36)
1.2.6 Neurodegeneration
(36)
1.2.7 Genetic Forensics: 2013-2016 Ebola Outbreak
(37)
1.2.8 Next Generation Sequence (NGS) Analysis
(38)
1.2.9 Neuroimaging-Genetics
(39)
1.3 Common Characteristics of Big (Biomedical and Health) Data
(40)
1.4 Data Science
(41)
1.5 Predictive Analytics
(41)
1.6 High-Throughput Big Data Analytics
(42)
1.7 Examples of Data Repositories, Archives, and Services
(42)
1.8 DSPA Expectations
(43)
Chapter 2: Foundations of R
(45)
2.1 Why Use R?
(45)
2.2 Getting Started
(47)
2.2.1 Install Basic Shell-Based R
(47)
2.2.2 GUI Based R Invocation (RStudio)
(47)
2.2.3 RStudio GUI Layout
(47)
2.2.4 Some Notes
(48)
2.3 Help
(48)
2.4 Simple Wide-to-Long Data format Translation
(49)
2.5 Data Generation
(50)
2.6 Input/Output (I/O)
(54)
2.7 Slicing and Extracting Data
(56)
2.8 Variable Conversion
(57)
2.9 Variable Information
(57)
2.10 Data Selection and Manipulation
(59)
2.11 Math Functions
(62)
2.12 Matrix Operations
(64)
2.13 Advanced Data Processing
(64)
2.14 Strings
(69)
2.15 Plotting
(71)
2.16 QQ Normal Probability Plot
(73)
2.17 Low-Level Plotting Commands
(77)
2.18 Graphics Parameters
(77)
2.19 Optimization and model Fitting
(79)
2.20 Statistics
(80)
2.21 Distributions
(81)
2.21.1 Programming
(81)
2.22 Data Simulation Primer
(82)
2.23 Appendix
(88)
2.23.1 HTML SOCR Data Import
(88)
2.23.2 R Debugging
(89)
Example
(92)
2.24 Assignments: 2. R Foundations
(92)
2.24.1 Confirm that You Have Installed R/RStudio
(92)
2.24.2 Long-to-Wide Data Format Translation
(93)
2.24.3 Data Frames
(93)
2.24.4 Data Stratification
(93)
2.24.5 Simulation
(93)
2.24.6 Programming
(94)
References
(94)
Chapter 3: Managing Data in R
(95)
3.1 Saving and Loading R Data Structures
(95)
3.2 Importing and Saving Data from CSV Files
(96)
3.3 Exploring the Structure of Data
(98)
3.4 Exploring Numeric Variables
(98)
3.5 Measuring the Central Tendency: Mean, Median, Mode
(99)
3.6 Measuring Spread: Quartiles and the Five-Number Summary
(100)
3.7 Visualizing Numeric Variables: Boxplots
(102)
3.8 Visualizing Numeric Variables: Histograms
(103)
3.9 Understanding Numeric Data: Uniform and Normal Distributions
(104)
3.10 Measuring Spread: Variance and Standard Deviation
(105)
3.11 Exploring Categorical Variables
(108)
3.12 Exploring Relationships Between Variables
(109)
3.13 Missing Data
(111)
3.13.1 Simulate Some Real Multivariate Data
(116)
3.13.2 TBI Data Example
(130)
3.13.3 Imputation via Expectation-Maximization
(154)
Types of Missing Data
(154)
General Idea of EM Algorithm
(154)
EM-Based Imputation
(155)
A Simple Manual Implementation of EM-Based Imputation
(156)
Plotting Complete and Imputed Data
(159)
Validation of EM-Imputation Using the Amelia R Package
(160)
Comparison
(160)
Density Plots
(162)
3.14 Parsing Webpages and Visualizing Tabular HTML Data
(162)
3.15 Cohort-Rebalancing (for Imbalanced Groups)
(167)
3.16 Appendix
(170)
3.16.1 Importing Data from SQL Databases
(170)
3.16.2 R Code Fragments
(171)
3.17 Assignments: 3. Managing Data in R
(172)
3.17.1 Import, Plot, Summarize and Save Data
(172)
3.17.2 Explore some Bivariate Relations in the Data
(172)
3.17.3 Missing Data
(173)
3.17.4 Surface Plots
(173)
3.17.5 Unbalanced Designs
(173)
3.17.6 Aggregate Analysis
(173)
References
(173)
Chapter 4: Data Visualization
(174)
4.1 Common Questions
(174)
4.2 Classification of Visualization Methods
(175)
4.3 Composition
(175)
4.3.1 Histograms and Density Plots
(175)
4.3.2 Pie Chart
(178)
4.3.3 Heat Map
(180)
4.4 Comparison
(183)
4.4.1 Paired Scatter Plots
(183)
4.4.2 Jitter Plot
(188)
4.4.3 Bar Plots
(190)
4.4.4 Trees and Graphs
(195)
4.4.5 Correlation Plots
(198)
4.5 Relationships
(202)
4.5.1 Line Plots Using ggplot
(202)
4.5.2 Density Plots
(204)
4.5.3 Distributions
(204)
4.5.4 2D Kernel Density and 3D Surface Plots
(205)
4.5.5 Multiple 2D Image Surface Plots
(207)
4.5.6 3D and 4D Visualizations
(209)
4.6 Appendix
(214)
4.6.1 Hands-on Activity (Health Behavior Risks)
(214)
4.6.2 Additional ggplot Examples
(218)
Housing Price Data
(218)
Modeling the Home Price Index Data (Fig. 4.48)
(220)
Map of the Neighborhoods of Los Angeles (LA)
(222)
Latin Letter Frequency in Different Languages
(224)
4.7 Assignments 4: Data Visualization
(229)
4.7.1 Common Plots
(229)
4.7.2 Trees and Graphs
(229)
4.7.3 Exploratory Data Analytics (EDA)
(230)
References
(230)
Chapter 5: Linear Algebra and Matrix Computing
(231)
5.1 Matrices (Second Order Tensors)
(232)
5.1.1 Create Matrices
(232)
5.1.2 Adding Columns and Rows
(233)
5.2 Matrix Subscripts
(234)
5.3 Matrix Operations
(234)
5.3.1 Addition
(234)
5.3.2 Subtraction
(235)
5.3.3 Multiplication
(235)
Elementwise Multiplication
(235)
Matrix Multiplication
(235)
5.3.4 Element-wise Division
(237)
5.3.5 Transpose
(237)
5.3.6 Multiplicative Inverse
(237)
5.4 Matrix Algebra Notation
(239)
5.4.1 Linear Models
(239)
5.4.2 Solving Systems of Equations
(240)
5.4.3 The Identity Matrix
(242)
5.5 Scalars, Vectors and Matrices
(243)
5.5.1 Sample Statistics (Mean, Variance)
(245)
Mean
(245)
Variance
(246)
Applications of Matrix Algebra: Linear Modeling
(246)
Finding Function Extrema (Min/Max) Using Calculus
(247)
5.5.2 Least Square Estimation
(248)
The R lm Function
(249)
5.6 Eigenvalues and Eigenvectors
(249)
5.7 Other Important Functions
(250)
5.8 Matrix Notation (Another View)
(250)
5.9 Multivariate Linear Regression
(254)
5.10 Sample Covariance Matrix
(257)
5.11 Assignments: 5. Linear Algebra and Matrix Computing
(259)
5.11.1 How Is Matrix Multiplication Defined?
(259)
5.11.2 Scalar Versus Matrix Multiplication
(259)
5.11.3 Matrix Equations
(259)
5.11.4 Least Square Estimation
(260)
5.11.5 Matrix Manipulation
(260)
5.11.6 Matrix Transpose
(260)
5.11.7 Sample Statistics
(260)
5.11.8 Least Square Estimation
(260)
5.11.9 Eigenvalues and Eigenvectors
(261)
References
(261)
Chapter 6: Dimensionality Reduction
(262)
6.1 Example: Reducing 2D to 1D
(262)
6.2 Matrix Rotations
(266)
6.3 Notation
(271)
6.4 Summary (PCA vs. ICA vs. FA)
(271)
6.5 Principal Component Analysis (PCA)
(272)
6.5.1 Principal Components
(272)
6.6 Independent Component Analysis (ICA)
(279)
6.7 Factor Analysis (FA)
(283)
6.8 Singular Value Decomposition (SVD)
(285)
6.9 SVD Summary
(287)
6.10 Case Study for Dimension Reduction (Parkinson´s Disease)
(287)
6.11 Assignments: 6. Dimensionality Reduction
(294)
6.11.1 Parkinson´s Disease Example
(294)
6.11.2 Allometric Relations in Plants Example
(295)
Load Data
(295)
Dimensionality Reduction
(295)
References
(295)
Chapter 7: Lazy Learning: Classification Using Nearest Neighbors
(296)
7.1 Motivation
(297)
7.2 The kNN Algorithm Overview
(298)
7.2.1 Distance Function and Dummy Coding
(298)
7.2.2 Ways to Determine k
(299)
7.2.3 Rescaling of the Features
(299)
7.2.4 Rescaling Formulas
(300)
7.3 Case Study
(300)
7.3.1 Step 1: Collecting Data
(300)
7.3.2 Step 2: Exploring and Preparing the Data
(301)
7.3.3 Normalizing Data
(302)
7.3.4 Data Preparation: Creating Training and Testing Datasets
(303)
7.3.5 Step 3: Training a Model On the Data
(303)
7.3.6 Step 4: Evaluating Model Performance
(303)
7.3.7 Step 5: Improving Model Performance
(304)
7.3.8 Testing Alternative Values of k
(305)
7.3.9 Quantitative Assessment (Tables 7.2 and 7.3)
(311)
7.4 Assignments: 7. Lazy Learning: Classification Using Nearest Neighbors
(315)
7.4.1 Traumatic Brain Injury (TBI)
(315)
7.4.2 Parkinson´s Disease
(315)
7.4.3 KNN Classification in a High Dimensional Space
(316)
7.4.4 KNN Classification in a Lower Dimensional Space
(316)
References
(316)
Chapter 8: Probabilistic Learning: Classification Using Naive Bayes
(317)
8.1 Overview of the Naive Bayes Algorithm
(317)
8.2 Assumptions
(318)
8.3 Bayes Formula
(318)
8.4 The Laplace Estimator
(320)
8.5 Case Study: Head and Neck Cancer Medication
(321)
8.5.1 Step 1: Collecting Data
(321)
8.5.2 Step 2: Exploring and Preparing the Data
(321)
Data Preparation: Processing Text Data for Analysis
(322)
Data Preparation: Creating Training and Test Datasets
(323)
Visualizing Text Data: Word Clouds
(325)
Data Preparation: Creating Indicator Features for Frequent Words
(326)
8.5.3 Step 3: Training a Model on the Data
(327)
8.5.4 Step 4: Evaluating Model Performance
(328)
8.5.5 Step 5: Improving Model Performance
(329)
8.5.6 Step 6: Compare Naive Bayesian against LDA
(330)
8.6 Practice Problem
(331)
8.7 Assignments 8: Probabilistic Learning: Classification Using Naive Bayes
(332)
8.7.1 Explain These Two Concepts
(332)
8.7.2 Analyzing Textual Data
(333)
References
(333)
Chapter 9: Decision Tree Divide and Conquer Classification
(334)
9.1 Motivation
(334)
9.2 Hands-on Example: Iris Data
(335)
9.3 Decision Tree Overview
(337)
9.3.1 Divide and Conquer
(338)
9.3.2 Entropy
(339)
9.3.3 Misclassification Error and Gini Index
(340)
9.3.4 C5.0 Decision Tree Algorithm
(340)
9.3.5 Pruning the Decision Tree
(342)
9.4 Case Study 1: Quality of Life and Chronic Disease
(343)
9.4.1 Step 1: Collecting Data
(343)
9.4.2 Step 2: Exploring and Preparing the Data
(343)
Data Preparation: Creating Random Training and Test Datasets
(345)
9.4.3 Step 3: Training a Model On the Data
(346)
9.4.4 Step 4: Evaluating Model Performance
(349)
9.4.5 Step 5: Trial Option
(350)
9.4.6 Loading the Misclassification Error Matrix
(351)
9.4.7 Parameter Tuning
(352)
9.5 Compare Different Impurity Indices
(358)
9.6 Classification Rules
(358)
9.6.1 Separate and Conquer
(358)
9.6.2 The One Rule Algorithm
(359)
9.6.3 The RIPPER Algorithm
(359)
9.7 Case Study 2: QoL in Chronic Disease (Take 2)
(359)
9.7.1 Step 3: Training a Model on the Data
(359)
9.7.2 Step 4: Evaluating Model Performance
(360)
9.7.3 Step 5: Alternative Model1
(361)
9.7.4 Step 5: Alternative Model2
(361)
9.8 Practice Problem
(364)
9.9 Assignments 9: Decision Tree Divide and Conquer Classification
(369)
9.9.1 Explain These Concepts
(369)
9.9.2 Decision Tree Partitioning
(369)
References
(370)
Chapter 10: Forecasting Numeric Data Using Regression Models
(371)
10.1 Understanding Regression
(371)
10.1.1 Simple Linear Regression
(371)
10.2 Ordinary Least Squares Estimation
(373)
10.2.1 Model Assumptions
(375)
10.2.2 Correlations
(375)
10.2.3 Multiple Linear Regression
(376)
10.3 Case Study 1: Baseball Players
(378)
10.3.1 Step 1: Collecting Data
(378)
10.3.2 Step 2: Exploring and Preparing the Data
(378)
10.3.3 Exploring Relationships Among Features: The Correlation Matrix
(382)
10.3.4 Visualizing Relationships Among Features: The Scatterplot Matrix
(382)
10.3.5 Step 3: Training a Model on the Data
(384)
10.3.6 Step 4: Evaluating Model Performance
(385)
10.4 Step 5: Improving Model Performance
(387)
10.4.1 Model Specification: Adding Non-linear Relationships
(395)
10.4.2 Transformation: Converting a Numeric Variable to a Binary Indicator
(396)
10.4.3 Model Specification: Adding Interaction Effects
(397)
10.5 Understanding Regression Trees and Model Trees
(399)
10.5.1 Adding Regression to Trees
(399)
10.6 Case Study 2: Baseball Players (Take 2)
(400)
10.6.1 Step 2: Exploring and Preparing the Data
(400)
10.6.2 Step 3: Training a Model On the Data
(401)
10.6.3 Visualizing Decision Trees
(401)
10.6.4 Step 4: Evaluating Model Performance
(403)
10.6.5 Measuring Performance with Mean Absolute Error
(404)
10.6.6 Step 5: Improving Model Performance
(404)
10.7 Practice Problem: Heart Attack Data
(406)
10.8 Assignments: 10. Forecasting Numeric Data Using Regression Models
(407)
References
(407)
Chapter 11: Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
(408)
11.1 Understanding Neural Networks
(408)
11.1.1 From Biological to Artificial Neurons
(408)
11.1.2 Activation Functions
(409)
11.1.3 Network Topology
(411)
11.1.4 The Direction of Information Travel
(411)
11.1.5 The Number of Nodes in Each Layer
(411)
11.1.6 Training Neural Networks with Backpropagation
(412)
11.2 Case Study 1: Google Trends and the Stock Market: Regression
(413)
11.2.1 Step 1: Collecting Data
(413)
Variables
(413)
11.2.2 Step 2: Exploring and Preparing the Data
(414)
11.2.3 Step 3: Training a Model on the Data
(416)
11.2.4 Step 4: Evaluating Model Performance
(417)
11.2.5 Step 5: Improving Model Performance
(418)
11.2.6 Step 6: Adding Additional Layers
(419)
11.3 Simple NN Demo: Learning to Compute
(419)
11.4 Case Study 2: Google Trends and the Stock Market - Classification
(421)
11.5 Support Vector Machines (SVM)
(423)
11.5.1 Classification with Hyperplanes
(424)
Finding the Maximum Margin
(424)
Linearly Separable Data
(424)
Non-linearly Separable Data
(427)
Using Kernels for Non-linear Spaces
(428)
11.6 Case Study 3: Optical Character Recognition (OCR)
(428)
11.6.1 Step 1: Prepare and Explore the Data
(429)
11.6.2 Step 2: Training an SVM Model
(430)
11.6.3 Step 3: Evaluating Model Performance
(431)
11.6.4 Step 4: Improving Model Performance
(433)
11.7 Case Study 4: Iris Flowers
(434)
11.7.1 Step 1: Collecting Data
(434)
11.7.2 Step 2: Exploring and Preparing the Data
(434)
11.7.3 Step 3: Training a Model on the Data
(436)
11.7.4 Step 4: Evaluating Model Performance
(437)
11.7.5 Step 5: RBF Kernel Function
(438)
11.7.6 Parameter Tuning
(438)
11.7.7 Improving the Performance of Gaussian Kernels
(440)
11.8 Practice
(441)
11.8.1 Problem 1 Google Trends and the Stock Market
(441)
11.8.2 Problem 2: Quality of Life and Chronic Disease
(441)
11.9 Appendix
(445)
11.10 Assignments: 11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
(446)
11.10.1 Learn and Predict a Power-Function
(446)
11.10.2 Pediatric Schizophrenia Study
(446)
References
(447)
Chapter 12: Apriori Association Rules Learning
(448)
12.1 Association Rules
(448)
12.2 The Apriori Algorithm for Association Rule Learning
(449)
12.3 Measuring Rule Importance by Using Support and Confidence
(449)
12.4 Building a Set of Rules with the Apriori Principle
(450)
12.5 A Toy Example
(451)
12.6 Case Study 1: Head and Neck Cancer Medications
(452)
12.6.1 Step 1: Collecting Data
(452)
12.6.2 Step 2: Exploring and Preparing the Data
(452)
Visualizing Item Support: Item Frequency Plots
(454)
Visualizing Transaction Data: Plotting the Sparse Matrix
(455)
12.6.3 Step 3: Training a Model on the Data
(457)
12.6.4 Step 4: Evaluating Model Performance
(458)
12.6.5 Step 5: Improving Model Performance
(460)
Sorting the Set of Association Rules
(460)
Taking Subsets of Association Rules
(461)
Saving Association Rules to a File or Data Frame
(463)
12.7 Practice Problems: Groceries
(463)
12.8 Summary
(466)
12.9 Assignments: 12. Apriori Association Rules Learning
(467)
References
(467)
Chapter 13: k-Means Clustering
(468)
13.1 Clustering as a Machine Learning Task
(468)
13.2 Silhouette Plots
(471)
13.3 The k-Means Clustering Algorithm
(472)
13.3.1 Using Distance to Assign and Update Clusters
(472)
13.3.2 Choosing the Appropriate Number of Clusters
(473)
13.4 Case Study 1: Divorce and Consequences on Young Adults
(473)
13.4.1 Step 1: Collecting Data
(473)
Variables
(474)
13.4.2 Step 2: Exploring and Preparing the Data
(474)
13.4.3 Step 3: Training a Model on the Data
(475)
13.4.4 Step 4: Evaluating Model Performance
(476)
13.4.5 Step 5: Usage of Cluster Information
(479)
13.5 Model Improvement
(480)
13.5.1 Tuning the Parameter k
(482)
13.6 Case Study 2: Pediatric Trauma
(484)
13.6.1 Step 1: Collecting Data
(484)
13.6.2 Step 2: Exploring and Preparing the Data
(485)
13.6.3 Step 3: Training a Model on the Data
(486)
13.6.4 Step 4: Evaluating Model Performance
(487)
13.6.5 Practice Problem: Youth Development
(490)
13.7 Hierarchical Clustering
(492)
13.8 Gaussian Mixture Models
(495)
13.9 Summary
(497)
13.10 Assignments: 13. k-Means Clustering
(497)
References
(498)
Chapter 14: Model Performance Assessment
(499)
14.1 Measuring the Performance of Classification Methods
(499)
14.2 Evaluation Strategies
(501)
14.2.1 Binary Outcomes
(501)
14.2.2 Confusion Matrices
(502)
14.2.3 Other Measures of Performance Beyond Accuracy
(504)
14.2.4 The Kappa (κ) Statistic
(505)
Summary of the Kappa Score for Calculating Prediction Accuracy
(508)
14.2.5 Computation of Observed Accuracy and Expected Accuracy
(508)
14.2.6 Sensitivity and Specificity
(509)
14.2.7 Precision and Recall
(510)
14.2.8 The F-Measure
(511)
14.3 Visualizing Performance Tradeoffs (ROC Curve)
(512)
14.4 Estimating Future Performance (Internal Statistical Validation)
(515)
14.4.1 The Holdout Method
(515)
14.4.2 Cross-Validation
(516)
14.4.3 Bootstrap Sampling
(518)
14.5 Assignment: 14. Evaluation of Model Performance
(519)
References
(520)
Chapter 15: Improving Model Performance
(521)
15.1 Improving Model Performance by Parameter Tuning
(521)
15.2 Using caret for Automated Parameter Tuning
(521)
15.2.1 Customizing the Tuning Process
(525)
15.2.2 Improving Model Performance with Meta-learning
(526)
15.2.3 Bagging
(527)
15.2.4 Boosting
(529)
15.2.5 Random Forests
(530)
Training Random Forests
(530)
Evaluating Random Forest Performance
(531)
15.2.6 Adaptive Boosting
(532)
15.3 Assignment: 15. Improving Model Performance
(534)
15.3.1 Model Improvement Case Study
(535)
References
(535)
Chapter 16: Specialized Machine Learning Topics
(536)
16.1 Working with Specialized Data and Databases
(536)
16.1.1 Data Format Conversion
(537)
16.1.2 Querying Data in SQL Databases
(538)
16.1.3 Real Random Number Generation
(544)
16.1.4 Downloading the Complete Text of Web Pages
(545)
16.1.5 Reading and Writing XML with the XML Package
(546)
16.1.6 Web-Page Data Scraping
(547)
16.1.7 Parsing JSON from Web APIs
(548)
16.1.8 Reading and Writing Microsoft Excel Spreadsheets Using XLSX
(549)
16.2 Working with Domain-Specific Data
(550)
16.2.1 Working with Bioinformatics Data
(550)
16.2.2 Visualizing Network Data
(551)
16.3 Data Streaming
(556)
16.3.1 Definition
(556)
16.3.2 The stream Package
(557)
16.3.3 Synthetic Example: Random Gaussian Stream
(557)
k-Means Clustering
(557)
16.3.4 Sources of Data Streams
(559)
Static Structure Streams
(559)
Concept Drift Streams
(559)
Real Data Streams
(560)
16.3.5 Printing, Plotting and Saving Streams
(560)
16.3.6 Stream Animation
(561)
16.3.7 Case-Study: SOCR Knee Pain Data
(563)
16.3.8 Data Stream Clustering and Classification (DSC)
(565)
16.3.9 Evaluation of Data Stream Clustering
(568)
16.4 Optimization and Improving the Computational Performance
(569)
16.4.1 Generalizing Tabular Data Structures with dplyr
(570)
16.4.2 Making Data Frames Faster with Data.Table
(571)
16.4.3 Creating Disk-Based Data Frames with ff
(571)
16.4.4 Using Massive Matrices with bigmemory
(572)
16.5 Parallel Computing
(572)
16.5.1 Measuring Execution Time
(573)
16.5.2 Parallel Processing with Multiple Cores
(573)
16.5.3 Parallelization Using foreach and doParallel
(575)
16.5.4 GPU Computing
(576)
16.6 Deploying Optimized Learning Algorithms
(576)
16.6.1 Building Bigger Regression Models with biglm
(576)
16.6.2 Growing Bigger and Faster Random Forests with bigrf
(576)
16.6.3 Training and Evaluation Models in Parallel with caret
(577)
16.7 Practice Problem
(577)
16.8 Assignment: 16. Specialized Machine Learning Topics
(578)
16.8.1 Working with Website Data
(578)
16.8.2 Network Data and Visualization
(578)
16.8.3 Data Conversion and Parallel Computing
(578)
References
(579)
Chapter 17: Variable/Feature Selection
(580)
17.1 Feature Selection Methods
(580)
17.1.1 Filtering Techniques
(580)
17.1.2 Wrapper Methods
(581)
17.1.3 Embedded Techniques
(581)
17.2 Case Study: ALS
(582)
17.2.1 Step 1: Collecting Data
(582)
17.2.2 Step 2: Exploring and Preparing the Data
(582)
17.2.3 Step 3: Training a Model on the Data
(583)
17.2.4 Step 4: Evaluating Model Performance
(587)
Comparing with RFE
(587)
Comparing with Stepwise Feature Selection
(589)
17.3 Practice Problem
(592)
17.4 Assignment: 17. Variable/Feature Selection
(594)
17.4.1 Wrapper Feature Selection
(594)
17.4.2 Use the PPMI Dataset
(594)
References
(595)
Chapter 18: Regularized Linear Modeling and Controlled Variable Selection
(596)
18.1 Questions
(597)
18.2 Matrix Notation
(597)
18.3 Regularized Linear Modeling
(597)
18.3.1 Ridge Regression
(599)
18.3.2 Least Absolute Shrinkage and Selection Operator (LASSO) Regression
(602)
18.3.3 Predictor Standardization
(605)
18.3.4 Estimation Goals
(605)
18.4 Linear Regression
(605)
18.4.1 Drawbacks of Linear Regression
(606)
18.4.2 Assessing Prediction Accuracy
(606)
18.4.3 Estimating the Prediction Error
(606)
18.4.4 Improving the Prediction Accuracy
(607)
18.4.5 Variable Selection
(608)
18.5 Regularization Framework
(609)
18.5.1 Role of the Penalty Term
(609)
18.5.2 Role of the Regularization Parameter
(609)
18.5.3 LASSO
(610)
18.5.4 General Regularization Framework
(610)
18.6 Implementation of Regularization
(611)
18.6.1 Example: Neuroimaging-Genetics Study of Parkinson´s Disease Dataset
(611)
18.6.2 Computational Complexity
(613)
18.6.3 LASSO and Ridge Solution Paths
(613)
18.6.4 Choice of the Regularization Parameter
(621)
18.6.5 Cross Validation Motivation
(622)
18.6.6 n-Fold Cross Validation
(622)
18.6.7 LASSO 10-Fold Cross Validation
(623)
18.6.8 Stepwise OLS (Ordinary Least Squares)
(624)
18.6.9 Final Models
(625)
18.6.10 Model Performance
(627)
18.6.11 Comparing Selected Features
(627)
18.6.12 Summary
(628)
18.7 Knock-off Filtering: Simulated Example
(628)
18.7.1 Notes
(630)
18.8 PD Neuroimaging-Genetics Case-Study
(631)
18.8.1 Fetching, Cleaning and Preparing the Data
(631)
18.8.2 Preparing the Response Vector
(632)
18.8.3 False Discovery Rate (FDR)
(640)
Graphical Interpretation of the Benjamini-Hochberg (BH) Method
(641)
FDR Adjusting the p-Values
(642)
18.8.4 Running the Knockoff Filter
(643)
18.9 Assignment: 18. Regularized Linear Modeling and Knockoff Filtering
(644)
References
(645)
Chapter 19: Big Longitudinal Data Analysis
(646)
19.1 Time Series Analysis
(646)
19.1.1 Step 1: Plot Time Series
(649)
19.1.2 Step 2: Find Proper Parameter Values for ARIMA Model
(651)
19.1.3 Check the Differencing Parameter
(652)
19.1.4 Identifying the AR and MA Parameters
(653)
19.1.5 Step 3: Build an ARIMA Model
(655)
19.1.6 Step 4: Forecasting with ARIMA Model
(660)
19.2 Structural Equation Modeling (SEM)-Latent Variables
(661)
19.2.1 Foundations of SEM
(661)
19.2.2 SEM Components
(664)
19.2.3 Case Study - Parkinson´s Disease (PD)
(665)
Step 1 - Collecting Data
(665)
Step 2 - Exploring and Preparing the Data
(665)
Step 3 - Fitting a Model on the Data
(668)
19.2.4 Outputs of Lavaan SEM
(670)
19.3 Longitudinal Data Analysis-Linear Mixed Models
(671)
19.3.1 Mean Trend
(671)
19.3.2 Modeling the Correlation
(675)
19.4 GLMM/GEE Longitudinal Data Analysis
(676)
19.4.1 GEE Versus GLMM
(678)
19.5 Assignment: 19. Big Longitudinal Data Analysis
(680)
19.5.1 Imaging Data
(680)
19.5.2 Time Series Analysis
(681)
19.5.3 Latent Variables Model
(681)
References
(681)
Chapter 20: Natural Language Processing/Text Mining
(682)
20.1 A Simple NLP/TM Example
(683)
20.1.1 Define and Load the Unstructured-Text Documents
(684)
20.1.2 Create a New VCorpus Object
(686)
20.1.3 To-Lower Case Transformation
(687)
20.1.4 Text Pre-processing
(687)
Remove Stopwords
(687)
Remove Punctuation
(688)
Stemming: Removal of Plurals and Action Suffixes
(688)
20.1.5 Bags of Words
(689)
20.1.6 Document Term Matrix
(690)
20.2 Case-Study: Job Ranking
(692)
20.2.1 Step 1: Make a VCorpus Object
(693)
20.2.2 Step 2: Clean the VCorpus Object
(693)
20.2.3 Step 3: Build the Document Term Matrix
(693)
20.2.4 Area Under the ROC Curve
(697)
20.3 TF-IDF
(699)
20.3.1 Term Frequency (TF)
(699)
20.3.2 Inverse Document Frequency (IDF)
(699)
20.3.3 TF-IDF
(700)
20.4 Cosine Similarity
(708)
20.5 Sentiment Analysis
(709)
20.5.1 Data Preprocessing
(709)
20.5.2 NLP/TM Analytics
(712)
20.5.3 Prediction Optimization
(715)
20.6 Assignment: 20. Natural Language Processing/Text Mining
(717)
20.6.1 Mining Twitter Data
(717)
20.6.2 Mining Cancer Clinical Notes
(718)
References
(718)
Chapter 21: Prediction and Internal Statistical Cross Validation
(719)
21.1 Forecasting Types and Assessment Approaches
(719)
21.2 Overfitting
(720)
21.2.1 Example (US Presidential Elections)
(720)
21.2.2 Example (Google Flu Trends)
(720)
21.2.3 Example (Autism)
(722)
21.3 Internal Statistical Cross-Validation is an Iterative Process
(723)
21.4 Example (Linear Regression)
(724)
21.4.1 Cross-Validation Methods
(725)
21.4.2 Exhaustive Cross-Validation
(725)
21.4.3 Non-Exhaustive Cross-Validation
(726)
21.5 Case-Studies
(726)
21.5.1 Example 1: Prediction of Parkinson´s Disease Using Adaptive Boosting (AdaBoost)
(727)
21.5.2 Example 2: Sleep Dataset
(730)
21.5.3 Example 3: Model-Based (Linear Regression) Prediction Using the Attitude Dataset
(732)
21.5.4 Example 4: Parkinson´s Data (ppmi_data)
(733)
21.6 Summary of CV output
(734)
21.7 Alternative Predictor Functions
(734)
21.7.1 Logistic Regression
(735)
21.7.2 Quadratic Discriminant Analysis (QDA)
(736)
21.7.3 Foundation of LDA and QDA for Prediction, Dimensionality Reduction, and Forecasting
(737)
LDA (Linear Discriminant Analysis)
(738)
QDA (Quadratic Discriminant Analysis)
(738)
21.7.4 Neural Networks
(739)
21.7.5 SVM
(740)
21.7.6 k-Nearest Neighbors Algorithm (k-NN)
(741)
21.7.7 k-Means Clustering (k-MC)
(742)
21.7.8 Spectral Clustering
(749)
Iris Petal Data
(749)
Spirals Data
(750)
Income Data
(751)
21.8 Compare the Results
(752)
21.9 Assignment: 21. Prediction and Internal Statistical Cross-Validation
(755)
References
(756)
Chapter 22: Function Optimization
(757)
22.1 Free (Unconstrained) Optimization
(757)
22.1.1 Example 1: Minimizing a Univariate Function (Inverse-CDF)
(758)
22.1.2 Example 2: Minimizing a Bivariate Function
(760)
22.1.3 Example 3: Using Simulated Annealing to Find the Maximum of an Oscillatory Function
(761)
22.2 Constrained Optimization
(762)
22.2.1 Equality Constraints
(762)
22.2.2 Lagrange Multipliers
(762)
22.2.3 Inequality Constrained Optimization
(763)
Linear Programming (LP)
(763)
Mixed Integer Linear Programming (MILP)
(768)
22.2.4 Quadratic Programming (QP)
(769)
22.3 General Non-linear Optimization
(770)
22.3.1 Dual Problem Optimization
(771)
Motivation
(771)
Example 1: Linear Example
(772)
Example 2: Quadratic Example
(773)
Example 3: More Complex Non-linear Optimization
(774)
Example 4: Another Linear Example
(775)
22.4 Manual Versus Automated Lagrange Multiplier Optimization
(775)
22.5 Data Denoising
(778)
22.6 Assignment: 22. Function Optimization
(783)
22.6.1 Unconstrained Optimization
(783)
22.6.2 Linear Programming (LP)
(783)
22.6.3 Mixed Integer Linear Programming (MILP)
(784)
22.6.4 Quadratic Programming (QP)
(784)
22.6.5 Complex Non-linear Optimization
(784)
22.6.6 Data Denoising
(785)
References
(785)
Chapter 23: Deep Learning, Neural Networks
(786)
23.1 Deep Learning Training
(787)
23.1.1 Perceptrons
(787)
23.2 Biological Relevance
(789)
23.3 Simple Neural Net Examples
(791)
23.3.1 Exclusive OR (XOR) Operator
(791)
23.3.2 NAND Operator
(792)
23.3.3 Complex Networks Designed Using Simple Building Blocks
(793)
23.4 Classification
(794)
23.4.1 Sonar Data Example
(795)
23.4.2 MXNet Notes
(802)
23.5 Case-Studies
(803)
23.5.1 ALS Regression Example
(804)
23.5.2 Spirals 2D Data
(806)
23.5.3 IBS Study
(810)
23.5.4 Country QoL Ranking Data
(813)
23.5.5 Handwritten Digits Classification
(816)
Configuring the Neural Network
(820)
Training
(821)
Forecasting
(821)
Examining the Network Structure Using LeNet
(825)
23.6 Classifying Real-World Images
(827)
23.6.1 Load the Pre-trained Model
(827)
23.6.2 Load, Preprocess and Classify New Images - US Weather Pattern
(827)
23.6.3 Lake Mapourika, New Zealand
(831)
23.6.4 Beach Image
(832)
23.6.5 Volcano
(833)
23.6.6 Brain Surface
(835)
23.6.7 Face Mask
(836)
23.7 Assignment: 23. Deep Learning, Neural Networks
(837)
23.7.1 Deep Learning Classification
(837)
23.7.2 Deep Learning Regression
(838)
23.7.3 Image Classification
(838)
References
(838)
Summary
(839)
Glossary
(842)
Index
(844)