Loading...

Please enable javascript in your browser.

Data science and predictive analytics

Dinov, Ivo D | 2018

2596 Viewed

Type of Document: Book
Publisher: Switzerland : Springer International Publishing AG , 2018
Keywords:
Big data ; Mathematical statistics ; Medical records -- Data processing ; R (Computer program language)

Digital Object List

محتواي کتاب
view
download

Bookmark

Foreword
Preface
- Genesis
- Purpose
- Limitations/Prerequisites
- Scope of the Book
- Acknowledgements
DSPA Application and Use Disclaimer
- Biomedical, Biosocial, Environmental, and Health Disclaimer
Notations
Contents
Chapter 1: Motivation
- 1.1 DSPA Mission and Objectives
- 1.2 Examples of Driving Motivational Problems and Challenges
- 1.3 Common Characteristics of Big (Biomedical and Health) Data
- 1.4 Data Science
- 1.5 Predictive Analytics
- 1.6 High-Throughput Big Data Analytics
- 1.7 Examples of Data Repositories, Archives, and Services
- 1.8 DSPA Expectations
Chapter 2: Foundations of R
- 2.1 Why Use R?
- 2.2 Getting Started
  - 2.2.1 Install Basic Shell-Based R
  - 2.2.2 GUI Based R Invocation (RStudio)
  - 2.2.3 RStudio GUI Layout
  - 2.2.4 Some Notes
- 2.3 Help
- 2.4 Simple Wide-to-Long Data format Translation
- 2.5 Data Generation
- 2.6 Input/Output (I/O)
- 2.7 Slicing and Extracting Data
- 2.8 Variable Conversion
- 2.9 Variable Information
- 2.10 Data Selection and Manipulation
- 2.11 Math Functions
- 2.12 Matrix Operations
- 2.13 Advanced Data Processing
- 2.14 Strings
- 2.15 Plotting
- 2.16 QQ Normal Probability Plot
- 2.17 Low-Level Plotting Commands
- 2.18 Graphics Parameters
- 2.19 Optimization and model Fitting
- 2.20 Statistics
- 2.21 Distributions
  - 2.21.1 Programming
- 2.22 Data Simulation Primer
- 2.23 Appendix
  - 2.23.1 HTML SOCR Data Import
  - 2.23.2 R Debugging
    - Example
- 2.24 Assignments: 2. R Foundations
  - 2.24.1 Confirm that You Have Installed R/RStudio
  - 2.24.2 Long-to-Wide Data Format Translation
  - 2.24.3 Data Frames
  - 2.24.4 Data Stratification
  - 2.24.5 Simulation
  - 2.24.6 Programming
- References
Chapter 3: Managing Data in R
- 3.1 Saving and Loading R Data Structures
- 3.2 Importing and Saving Data from CSV Files
- 3.3 Exploring the Structure of Data
- 3.4 Exploring Numeric Variables
- 3.5 Measuring the Central Tendency: Mean, Median, Mode
- 3.6 Measuring Spread: Quartiles and the Five-Number Summary
- 3.7 Visualizing Numeric Variables: Boxplots
- 3.8 Visualizing Numeric Variables: Histograms
- 3.9 Understanding Numeric Data: Uniform and Normal Distributions
- 3.10 Measuring Spread: Variance and Standard Deviation
- 3.11 Exploring Categorical Variables
- 3.12 Exploring Relationships Between Variables
- 3.13 Missing Data
- 3.14 Parsing Webpages and Visualizing Tabular HTML Data
- 3.15 Cohort-Rebalancing (for Imbalanced Groups)
- 3.16 Appendix
  - 3.16.1 Importing Data from SQL Databases
  - 3.16.2 R Code Fragments
- 3.17 Assignments: 3. Managing Data in R
- References
Chapter 4: Data Visualization
- 4.1 Common Questions
- 4.2 Classification of Visualization Methods
- 4.3 Composition
- 4.4 Comparison
- 4.5 Relationships
- 4.6 Appendix
  - 4.6.1 Hands-on Activity (Health Behavior Risks)
  - 4.6.2 Additional ggplot Examples
- 4.7 Assignments 4: Data Visualization
- References
Chapter 5: Linear Algebra and Matrix Computing
- 5.1 Matrices (Second Order Tensors)
  - 5.1.1 Create Matrices
  - 5.1.2 Adding Columns and Rows
- 5.2 Matrix Subscripts
- 5.3 Matrix Operations
- 5.4 Matrix Algebra Notation
- 5.5 Scalars, Vectors and Matrices
  - 5.5.1 Sample Statistics (Mean, Variance)
  - 5.5.2 Least Square Estimation
    - The R lm Function
- 5.6 Eigenvalues and Eigenvectors
- 5.7 Other Important Functions
- 5.8 Matrix Notation (Another View)
- 5.9 Multivariate Linear Regression
- 5.10 Sample Covariance Matrix
- 5.11 Assignments: 5. Linear Algebra and Matrix Computing
- References
Chapter 6: Dimensionality Reduction
- 6.1 Example: Reducing 2D to 1D
- 6.2 Matrix Rotations
- 6.3 Notation
- 6.4 Summary (PCA vs. ICA vs. FA)
- 6.5 Principal Component Analysis (PCA)
  - 6.5.1 Principal Components
- 6.6 Independent Component Analysis (ICA)
- 6.7 Factor Analysis (FA)
- 6.8 Singular Value Decomposition (SVD)
- 6.9 SVD Summary
- 6.10 Case Study for Dimension Reduction (Parkinson´s Disease)
- 6.11 Assignments: 6. Dimensionality Reduction
  - 6.11.1 Parkinson´s Disease Example
  - 6.11.2 Allometric Relations in Plants Example
    - Load Data
    - Dimensionality Reduction
- References
Chapter 7: Lazy Learning: Classification Using Nearest Neighbors
- 7.1 Motivation
- 7.2 The kNN Algorithm Overview
- 7.3 Case Study
- 7.4 Assignments: 7. Lazy Learning: Classification Using Nearest Neighbors
- References
Chapter 8: Probabilistic Learning: Classification Using Naive Bayes
- 8.1 Overview of the Naive Bayes Algorithm
- 8.2 Assumptions
- 8.3 Bayes Formula
- 8.4 The Laplace Estimator
- 8.5 Case Study: Head and Neck Cancer Medication
- 8.6 Practice Problem
- 8.7 Assignments 8: Probabilistic Learning: Classification Using Naive Bayes
  - 8.7.1 Explain These Two Concepts
  - 8.7.2 Analyzing Textual Data
- References
Chapter 9: Decision Tree Divide and Conquer Classification
- 9.1 Motivation
- 9.2 Hands-on Example: Iris Data
- 9.3 Decision Tree Overview
- 9.4 Case Study 1: Quality of Life and Chronic Disease
- 9.5 Compare Different Impurity Indices
- 9.6 Classification Rules
- 9.7 Case Study 2: QoL in Chronic Disease (Take 2)
- 9.8 Practice Problem
- 9.9 Assignments 9: Decision Tree Divide and Conquer Classification
  - 9.9.1 Explain These Concepts
  - 9.9.2 Decision Tree Partitioning
- References
Chapter 10: Forecasting Numeric Data Using Regression Models
- 10.1 Understanding Regression
  - 10.1.1 Simple Linear Regression
- 10.2 Ordinary Least Squares Estimation
- 10.3 Case Study 1: Baseball Players
- 10.4 Step 5: Improving Model Performance
- 10.5 Understanding Regression Trees and Model Trees
  - 10.5.1 Adding Regression to Trees
- 10.6 Case Study 2: Baseball Players (Take 2)
- 10.7 Practice Problem: Heart Attack Data
- 10.8 Assignments: 10. Forecasting Numeric Data Using Regression Models
- References
Chapter 11: Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
- 11.1 Understanding Neural Networks
- 11.2 Case Study 1: Google Trends and the Stock Market: Regression
- 11.3 Simple NN Demo: Learning to Compute
- 11.4 Case Study 2: Google Trends and the Stock Market - Classification
- 11.5 Support Vector Machines (SVM)
  - 11.5.1 Classification with Hyperplanes
- 11.6 Case Study 3: Optical Character Recognition (OCR)
- 11.7 Case Study 4: Iris Flowers
- 11.8 Practice
  - 11.8.1 Problem 1 Google Trends and the Stock Market
  - 11.8.2 Problem 2: Quality of Life and Chronic Disease
- 11.9 Appendix
- 11.10 Assignments: 11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
  - 11.10.1 Learn and Predict a Power-Function
  - 11.10.2 Pediatric Schizophrenia Study
- References
Chapter 12: Apriori Association Rules Learning
- 12.1 Association Rules
- 12.2 The Apriori Algorithm for Association Rule Learning
- 12.3 Measuring Rule Importance by Using Support and Confidence
- 12.4 Building a Set of Rules with the Apriori Principle
- 12.5 A Toy Example
- 12.6 Case Study 1: Head and Neck Cancer Medications
- 12.7 Practice Problems: Groceries
- 12.8 Summary
- 12.9 Assignments: 12. Apriori Association Rules Learning
- References
Chapter 13: k-Means Clustering
- 13.1 Clustering as a Machine Learning Task
- 13.2 Silhouette Plots
- 13.3 The k-Means Clustering Algorithm
  - 13.3.1 Using Distance to Assign and Update Clusters
  - 13.3.2 Choosing the Appropriate Number of Clusters
- 13.4 Case Study 1: Divorce and Consequences on Young Adults
- 13.5 Model Improvement
  - 13.5.1 Tuning the Parameter k
- 13.6 Case Study 2: Pediatric Trauma
- 13.7 Hierarchical Clustering
- 13.8 Gaussian Mixture Models
- 13.9 Summary
- 13.10 Assignments: 13. k-Means Clustering
- References
Chapter 14: Model Performance Assessment
- 14.1 Measuring the Performance of Classification Methods
- 14.2 Evaluation Strategies
- 14.3 Visualizing Performance Tradeoffs (ROC Curve)
- 14.4 Estimating Future Performance (Internal Statistical Validation)
- 14.5 Assignment: 14. Evaluation of Model Performance
- References
Chapter 15: Improving Model Performance
- 15.1 Improving Model Performance by Parameter Tuning
- 15.2 Using caret for Automated Parameter Tuning
- 15.3 Assignment: 15. Improving Model Performance
  - 15.3.1 Model Improvement Case Study
- References
Chapter 16: Specialized Machine Learning Topics
- 16.1 Working with Specialized Data and Databases
- 16.2 Working with Domain-Specific Data
  - 16.2.1 Working with Bioinformatics Data
  - 16.2.2 Visualizing Network Data
- 16.3 Data Streaming
- 16.4 Optimization and Improving the Computational Performance
- 16.5 Parallel Computing
- 16.6 Deploying Optimized Learning Algorithms
- 16.7 Practice Problem
- 16.8 Assignment: 16. Specialized Machine Learning Topics
- References
Chapter 17: Variable/Feature Selection
- 17.1 Feature Selection Methods
- 17.2 Case Study: ALS
- 17.3 Practice Problem
- 17.4 Assignment: 17. Variable/Feature Selection
  - 17.4.1 Wrapper Feature Selection
  - 17.4.2 Use the PPMI Dataset
- References
Chapter 18: Regularized Linear Modeling and Controlled Variable Selection
- 18.1 Questions
- 18.2 Matrix Notation
- 18.3 Regularized Linear Modeling
- 18.4 Linear Regression
- 18.5 Regularization Framework
- 18.6 Implementation of Regularization
- 18.7 Knock-off Filtering: Simulated Example
  - 18.7.1 Notes
- 18.8 PD Neuroimaging-Genetics Case-Study
- 18.9 Assignment: 18. Regularized Linear Modeling and Knockoff Filtering
- References
Chapter 19: Big Longitudinal Data Analysis
- 19.1 Time Series Analysis
- 19.2 Structural Equation Modeling (SEM)-Latent Variables
- 19.3 Longitudinal Data Analysis-Linear Mixed Models
  - 19.3.1 Mean Trend
  - 19.3.2 Modeling the Correlation
- 19.4 GLMM/GEE Longitudinal Data Analysis
  - 19.4.1 GEE Versus GLMM
- 19.5 Assignment: 19. Big Longitudinal Data Analysis
- References
Chapter 20: Natural Language Processing/Text Mining
- 20.1 A Simple NLP/TM Example
- 20.2 Case-Study: Job Ranking
- 20.3 TF-IDF
- 20.4 Cosine Similarity
- 20.5 Sentiment Analysis
- 20.6 Assignment: 20. Natural Language Processing/Text Mining
  - 20.6.1 Mining Twitter Data
  - 20.6.2 Mining Cancer Clinical Notes
- References
Chapter 21: Prediction and Internal Statistical Cross Validation
- 21.1 Forecasting Types and Assessment Approaches
- 21.2 Overfitting
- 21.3 Internal Statistical Cross-Validation is an Iterative Process
- 21.4 Example (Linear Regression)
- 21.5 Case-Studies
- 21.6 Summary of CV output
- 21.7 Alternative Predictor Functions
- 21.8 Compare the Results
- 21.9 Assignment: 21. Prediction and Internal Statistical Cross-Validation
- References
Chapter 22: Function Optimization
- 22.1 Free (Unconstrained) Optimization
- 22.2 Constrained Optimization
- 22.3 General Non-linear Optimization
  - 22.3.1 Dual Problem Optimization
- 22.4 Manual Versus Automated Lagrange Multiplier Optimization
- 22.5 Data Denoising
- 22.6 Assignment: 22. Function Optimization
- References
Chapter 23: Deep Learning, Neural Networks
- 23.1 Deep Learning Training
  - 23.1.1 Perceptrons
- 23.2 Biological Relevance
- 23.3 Simple Neural Net Examples
- 23.4 Classification
  - 23.4.1 Sonar Data Example
  - 23.4.2 MXNet Notes
- 23.5 Case-Studies
- 23.6 Classifying Real-World Images
- 23.7 Assignment: 23. Deep Learning, Neural Networks
- References
Summary
Glossary
Index

...see more