Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter
"Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter" is a book that teaches the reader how to use the Python programming language and its associated libraries, such as pandas, NumPy, and Jupyter, for data analysis and manipulation. The book covers topics such as how to work with pandas data structures, how to use pandas for data wrangling, how to use NumPy for numerical computation, and how to use Jupyter for interactive data visualization. The book also provides practical examples and exercises to help the reader gain hands-on experience with these tools. Overall, the book aims to provide a comprehensive introduction to using Python for data analysis and manipulation.
"Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter" is a book that covers the use of the Python programming language for data analysis, specifically using the pandas, NumPy, and Jupyter libraries. The book covers topics such as cleaning and manipulating data, working with missing data, and handling time series data. It also covers advanced topics such as merging and joining data, reshaping data, and working with data in different formats. Additionally, the book includes practical examples and exercises to help the reader apply the concepts learned.
Table of Contents
1. Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What Is This Book About? 1
What Kinds of Data? 1
1.2 Why Python for Data Analysis? 2
Python as Glue 3
Solving the “Two-Language” Problem 3
Why Not Python? 3
1.3 Essential Python Libraries 4
NumPy 4
pandas 5
matplotlib 6
IPython and Jupyter 6
SciPy 7
scikit-learn 8
statsmodels 8
Other Packages 9
1.4 Installation and Setup 9
Miniconda on Windows 9
GNU/Linux 10
Miniconda on macOS 11
Installing Necessary Packages 11
Integrated Development Environments and Text Editors 12
1.5 Community and Conferences 13
1.6 Navigating This Book 14
Code Examples 15
Data for Examples 15
Import Conventions 16
2. Python Language Basics, IPython, and Jupyter Notebooks. . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 The Python Interpreter 18
2.2 IPython Basics 19
Running the IPython Shell 19
Running the Jupyter Notebook 20
Tab Completion 23
Introspection 25
2.3 Python Language Basics 26
Language Semantics 26
Scalar Types 34
Control Flow 42
2.4 Conclusion 45
3. Built-In Data Structures, Functions, and Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 Data Structures and Sequences 47
Tuple 47
List 51
Dictionary 55
Set 59
Built-In Sequence Functions 62
List, Set, and Dictionary Comprehensions 63
3.2 Functions 65
Namespaces, Scope, and Local Functions 67
Returning Multiple Values 68
Functions Are Objects 69
Anonymous (Lambda) Functions 70
Generators 71
Errors and Exception Handling 74
3.3 Files and the Operating System 76
Bytes and Unicode with Files 80
3.4 Conclusion 82
4. NumPy Basics: Arrays and Vectorized Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 The NumPy ndarray: A Multidimensional Array Object 85
Creating ndarrays 86
Data Types for ndarrays 88
Arithmetic with NumPy Arrays 91
Basic Indexing and Slicing 92
Boolean Indexing 97
Fancy Indexing 100
Transposing Arrays and Swapping Axes 102
4.2 Pseudorandom Number Generation 103
4.3 Universal Functions: Fast Element-Wise Array Functions 105
4.4 Array-Oriented Programming with Arrays 108
Expressing Conditional Logic as Array Operations 110
Mathematical and Statistical Methods 111
Methods for Boolean Arrays 113
Sorting 114
Unique and Other Set Logic 115
4.5 File Input and Output with Arrays 116
4.6 Linear Algebra 116
4.7 Example: Random Walks 118
Simulating Many Random Walks at Once 120
4.8 Conclusion 121
5. Getting Started with pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.1 Introduction to pandas Data Structures 124
Series 124
DataFrame 129
Index Objects 136
5.2 Essential Functionality 138
Reindexing 138
Dropping Entries from an Axis 141
Indexing, Selection, and Filtering 142
Arithmetic and Data Alignment 152
Function Application and Mapping 158
Sorting and Ranking 160
Axis Indexes with Duplicate Labels 164
5.3 Summarizing and Computing Descriptive Statistics 165
Correlation and Covariance 168
Unique Values, Value Counts, and Membership 170
5.4 Conclusion 173
6. Data Loading, Storage, and File Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.1 Reading and Writing Data in Text Format 175
Reading Text Files in Pieces 182
Writing Data to Text Format 184
Working with Other Delimited Formats 185
JSON Data 187
XML and HTML: Web Scraping 189
6.2 Binary Data Formats 193
Reading Microsoft Excel Files 194
Using HDF5 Format 195
6.3 Interacting with Web APIs 197
6.4 Interacting with Databases 199
6.5 Conclusion 201
7. Data Cleaning and Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.1 Handling Missing Data 203
Filtering Out Missing Data 205
Filling In Missing Data 207
7.2 Data Transformation 209
Removing Duplicates 209
Transforming Data Using a Function or Mapping 211
Replacing Values 212
Renaming Axis Indexes 214
Discretization and Binning 215
Detecting and Filtering Outliers 217
Permutation and Random Sampling 219
Computing Indicator/Dummy Variables 221
7.3 Extension Data Types 224
7.4 String Manipulation 227
Python Built-In String Object Methods 227
Regular Expressions 229
String Functions in pandas 232
7.5 Categorical Data 235
Background and Motivation 236
Categorical Extension Type in pandas 237
Computations with Categoricals 240
Categorical Methods 242
7.6 Conclusion 245
8. Data Wrangling: Join, Combine, and Reshape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
8.1 Hierarchical Indexing 247
Reordering and Sorting Levels 250
Summary Statistics by Level 251
Indexing with a DataFrame’s columns 252
8.2 Combining and Merging Datasets 253
Database-Style DataFrame Joins 254
Merging on Index 259
Concatenating Along an Axis 263
Combining Data with Overlap 268
8.3 Reshaping and Pivoting 270
Reshaping with Hierarchical Indexing 270
Pivoting “Long” to “Wide” Format 273
Pivoting “Wide” to “Long” Format 277
8.4 Conclusion 279
9. Plotting and Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
9.1 A Brief matplotlib API Primer 282
Figures and Subplots 283
Colors, Markers, and Line Styles 288
Ticks, Labels, and Legends 290
Annotations and Drawing on a Subplot 294
Saving Plots to File 296
matplotlib Configuration 297
9.2 Plotting with pandas and seaborn 298
Line Plots 298
Bar Plots 301
Histograms and Density Plots 309
Scatter or Point Plots 311
Facet Grids and Categorical Data 314
9.3 Other Python Visualization Tools 317
9.4 Conclusion 317
10. Data Aggregation and Group Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
10.1 How to Think About Group Operations 320
Iterating over Groups 324
Selecting a Column or Subset of Columns 326
Grouping with Dictionaries and Series 327
Grouping with Functions 328
Grouping by Index Levels 328
10.2 Data Aggregation 329
Column-Wise and Multiple Function Application 331
Returning Aggregated Data Without Row Indexes 335
10.3 Apply: General split-apply-combine 335
Suppressing the Group Keys 338
Quantile and Bucket Analysis 338
Example: Filling Missing Values with Group-Specific Values 340
Example: Random Sampling and Permutation 343
Example: Group Weighted Average and Correlation 344
Example: Group-Wise Linear Regression 347
10.4 Group Transforms and “Unwrapped” GroupBys 347
10.5 Pivot Tables and Cross-Tabulation 351
Cross-Tabulations: Crosstab 354
10.6 Conclusion 355
11. Time Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
11.1 Date and Time Data Types and Tools 358
Converting Between String and Datetime 359
11.2 Time Series Basics 361
Indexing, Selection, Subsetting 363
Time Series with Duplicate Indices 365
11.3 Date Ranges, Frequencies, and Shifting 366
Generating Date Ranges 367
Frequencies and Date Offsets 370
Shifting (Leading and Lagging) Data 371
11.4 Time Zone Handling 374
Time Zone Localization and Conversion 375
Operations with Time Zone-Aware Timestamp Objects 377
Operations Between Different Time Zones 378
11.5 Periods and Period Arithmetic 379
Period Frequency Conversion 380
Quarterly Period Frequencies 382
Converting Timestamps to Periods (and Back) 384
Creating a PeriodIndex from Arrays 385
11.6 Resampling and Frequency Conversion 387
Downsampling 388
Upsampling and Interpolation 391
Resampling with Periods 392
Grouped Time Resampling 394
11.7 Moving Window Functions 396
Exponentially Weighted Functions 399
Binary Moving Window Functions 401
User-Defined Moving Window Functions 402
11.8 Conclusion 403
12. Introduction to Modeling Libraries in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
12.1 Interfacing Between pandas and Model Code 405
12.2 Creating Model Descriptions with Patsy 408
Data Transformations in Patsy Formulas 410
Categorical Data and Patsy 412
12.3 Introduction to statsmodels 415
Estimating Linear Models 415
Estimating Time Series Processes 419
12.4 Introduction to scikit-learn 420
12.5 Conclusion 423
13. Data Analysis Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
13.1 Bitly Data from 1.USA.gov 425
Counting Time Zones in Pure Python 426
Counting Time Zones with pandas 428
13.2 MovieLens 1M Dataset 435
Measuring Rating Disagreement 439
13.3 US Baby Names 1880–2010 443
Analyzing Naming Trends 448
13.4 USDA Food Database 457
13.5 2012 Federal Election Commission Database 463
Donation Statistics by Occupation and Employer 466
Bucketing Donation Amounts 469
Donation Statistics by State 471
13.6 Conclusion 472
A. Advanced NumPy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
A.1 ndarray Object Internals 473
NumPy Data Type Hierarchy 474
A.2 Advanced Array Manipulation 476
Reshaping Arrays 476
C Versus FORTRAN Order 478
Concatenating and Splitting Arrays 479
Repeating Elements: tile and repeat 481
Fancy Indexing Equivalents: take and put 483
A.3 Broadcasting 484
Broadcasting over Other Axes 487
Setting Array Values by Broadcasting 489
A.4 Advanced ufunc Usage 490
ufunc Instance Methods 490
Writing New ufuncs in Python 493
A.5 Structured and Record Arrays 493
Nested Data Types and Multidimensional Fields 494
Why Use Structured Arrays? 495
A.6 More About Sorting 495
Indirect Sorts: argsort and lexsort 497
Alternative Sort Algorithms 498
Partially Sorting Arrays 499
numpy.searchsorted: Finding Elements in a Sorted Array 500
A.7 Writing Fast NumPy Functions with Numba 501
Creating Custom numpy.ufunc Objects with Numba 502
A.8 Advanced Array Input and Output 503
Memory-Mapped Files 503
HDF5 and Other Array Storage Options 504
A.9 Performance Tips 505
The Importance of Contiguous Memory 505
B. More on the IPython System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
B.1 Terminal Keyboard Shortcuts 509
B.2 About Magic Commands 510
The %run Command 512
Executing Code from the Clipboard 513
B.3 Using the Command History 514
Searching and Reusing the Command History 514
Input and Output Variables 515
B.4 Interacting with the Operating System 516
Shell Commands and Aliases 517
Directory Bookmark System 518
B.5 Software Development Tools 519
Interactive Debugger 519
Timing Code: %time and %timeit 523
Basic Profiling: %prun and %run -p 525
Profiling a Function Line by Line 527
B.6 Tips for Productive Code Development Using IPython 529
Reloading Module Dependencies 529
Code Design Tips 530
B.7 Advanced IPython Features 532
Profiles and Configuration 532
B.8 Conclusion 533
Comments
Post a Comment