Main menu

Pages

Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter

table of contents title

     Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter

    Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter

    "Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter" is a book that teaches the reader how to use the Python programming language and its associated libraries, such as pandas, NumPy, and Jupyter, for data analysis and manipulation. The book covers topics such as how to work with pandas data structures, how to use pandas for data wrangling, how to use NumPy for numerical computation, and how to use Jupyter for interactive data visualization. The book also provides practical examples and exercises to help the reader gain hands-on experience with these tools. Overall, the book aims to provide a comprehensive introduction to using Python for data analysis and manipulation.


    "Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter" is a book that covers the use of the Python programming language for data analysis, specifically using the pandas, NumPy, and Jupyter libraries. The book covers topics such as cleaning and manipulating data, working with missing data, and handling time series data. It also covers advanced topics such as merging and joining data, reshaping data, and working with data in different formats. Additionally, the book includes practical examples and exercises to help the reader apply the concepts learned.


    Table of Contents


    1. Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1 What Is This Book About? 1

    What Kinds of Data? 1

    1.2 Why Python for Data Analysis? 2

    Python as Glue 3

    Solving the “Two-Language” Problem 3

    Why Not Python? 3

    1.3 Essential Python Libraries 4

    NumPy 4

    pandas 5

    matplotlib 6

    IPython and Jupyter 6

    SciPy 7

    scikit-learn 8

    statsmodels 8

    Other Packages 9

    1.4 Installation and Setup 9

    Miniconda on Windows 9

    GNU/Linux 10

    Miniconda on macOS 11

    Installing Necessary Packages 11

    Integrated Development Environments and Text Editors 12

    1.5 Community and Conferences 13

    1.6 Navigating This Book 14

    Code Examples 15

    Data for Examples 15

    Import Conventions 16

    2. Python Language Basics, IPython, and Jupyter Notebooks. . . . . . . . . . . . . . . . . . . . . . . . 17

    2.1 The Python Interpreter 18

    2.2 IPython Basics 19

    Running the IPython Shell 19

    Running the Jupyter Notebook 20

    Tab Completion 23

    Introspection 25

    2.3 Python Language Basics 26

    Language Semantics 26

    Scalar Types 34

    Control Flow 42

    2.4 Conclusion 45

    3. Built-In Data Structures, Functions, and Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3.1 Data Structures and Sequences 47

    Tuple 47

    List 51

    Dictionary 55

    Set 59

    Built-In Sequence Functions 62

    List, Set, and Dictionary Comprehensions 63

    3.2 Functions 65

    Namespaces, Scope, and Local Functions 67

    Returning Multiple Values 68

    Functions Are Objects 69

    Anonymous (Lambda) Functions 70

    Generators 71

    Errors and Exception Handling 74

    3.3 Files and the Operating System 76

    Bytes and Unicode with Files 80

    3.4 Conclusion 82

    4. NumPy Basics: Arrays and Vectorized Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    4.1 The NumPy ndarray: A Multidimensional Array Object 85

    Creating ndarrays 86

    Data Types for ndarrays 88

    Arithmetic with NumPy Arrays 91

    Basic Indexing and Slicing 92

    Boolean Indexing 97

    Fancy Indexing 100

    Transposing Arrays and Swapping Axes 102

    4.2 Pseudorandom Number Generation 103

    4.3 Universal Functions: Fast Element-Wise Array Functions 105

    4.4 Array-Oriented Programming with Arrays 108

    Expressing Conditional Logic as Array Operations 110

    Mathematical and Statistical Methods 111

    Methods for Boolean Arrays 113

    Sorting 114

    Unique and Other Set Logic 115

    4.5 File Input and Output with Arrays 116

    4.6 Linear Algebra 116

    4.7 Example: Random Walks 118

    Simulating Many Random Walks at Once 120

    4.8 Conclusion 121

    5. Getting Started with pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    5.1 Introduction to pandas Data Structures 124

    Series 124

    DataFrame 129

    Index Objects 136

    5.2 Essential Functionality 138

    Reindexing 138

    Dropping Entries from an Axis 141

    Indexing, Selection, and Filtering 142

    Arithmetic and Data Alignment 152

    Function Application and Mapping 158

    Sorting and Ranking 160

    Axis Indexes with Duplicate Labels 164

    5.3 Summarizing and Computing Descriptive Statistics 165

    Correlation and Covariance 168

    Unique Values, Value Counts, and Membership 170

    5.4 Conclusion 173

    6. Data Loading, Storage, and File Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

    6.1 Reading and Writing Data in Text Format 175

    Reading Text Files in Pieces 182

    Writing Data to Text Format 184

    Working with Other Delimited Formats 185

    JSON Data 187

    XML and HTML: Web Scraping 189

    6.2 Binary Data Formats 193

    Reading Microsoft Excel Files 194

    Using HDF5 Format 195

    6.3 Interacting with Web APIs 197

    6.4 Interacting with Databases 199

    6.5 Conclusion 201

    7. Data Cleaning and Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

    7.1 Handling Missing Data 203

    Filtering Out Missing Data 205

    Filling In Missing Data 207

    7.2 Data Transformation 209

    Removing Duplicates 209

    Transforming Data Using a Function or Mapping 211

    Replacing Values 212

    Renaming Axis Indexes 214

    Discretization and Binning 215

    Detecting and Filtering Outliers 217

    Permutation and Random Sampling 219

    Computing Indicator/Dummy Variables 221

    7.3 Extension Data Types 224

    7.4 String Manipulation 227

    Python Built-In String Object Methods 227

    Regular Expressions 229

    String Functions in pandas 232

    7.5 Categorical Data 235

    Background and Motivation 236

    Categorical Extension Type in pandas 237

    Computations with Categoricals 240

    Categorical Methods 242

    7.6 Conclusion 245

    8. Data Wrangling: Join, Combine, and Reshape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

    8.1 Hierarchical Indexing 247

    Reordering and Sorting Levels 250

    Summary Statistics by Level 251

    Indexing with a DataFrame’s columns 252

    8.2 Combining and Merging Datasets 253

    Database-Style DataFrame Joins 254

    Merging on Index 259

    Concatenating Along an Axis 263

    Combining Data with Overlap 268

    8.3 Reshaping and Pivoting 270

    Reshaping with Hierarchical Indexing 270

    Pivoting “Long” to “Wide” Format 273

    Pivoting “Wide” to “Long” Format 277

    8.4 Conclusion 279

    9. Plotting and Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

    9.1 A Brief matplotlib API Primer 282

    Figures and Subplots 283

    Colors, Markers, and Line Styles 288

    Ticks, Labels, and Legends 290

    Annotations and Drawing on a Subplot 294

    Saving Plots to File 296

    matplotlib Configuration 297

    9.2 Plotting with pandas and seaborn 298

    Line Plots 298

    Bar Plots 301

    Histograms and Density Plots 309

    Scatter or Point Plots 311

    Facet Grids and Categorical Data 314

    9.3 Other Python Visualization Tools 317

    9.4 Conclusion 317

    10. Data Aggregation and Group Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

    10.1 How to Think About Group Operations 320

    Iterating over Groups 324

    Selecting a Column or Subset of Columns 326

    Grouping with Dictionaries and Series 327

    Grouping with Functions 328

    Grouping by Index Levels 328

    10.2 Data Aggregation 329

    Column-Wise and Multiple Function Application 331

    Returning Aggregated Data Without Row Indexes 335

    10.3 Apply: General split-apply-combine 335

    Suppressing the Group Keys 338

    Quantile and Bucket Analysis 338

    Example: Filling Missing Values with Group-Specific Values 340

    Example: Random Sampling and Permutation 343

    Example: Group Weighted Average and Correlation 344

    Example: Group-Wise Linear Regression 347

    10.4 Group Transforms and “Unwrapped” GroupBys 347

    10.5 Pivot Tables and Cross-Tabulation 351

    Cross-Tabulations: Crosstab 354

    10.6 Conclusion 355

    11. Time Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

    11.1 Date and Time Data Types and Tools 358

    Converting Between String and Datetime 359

    11.2 Time Series Basics 361

    Indexing, Selection, Subsetting 363

    Time Series with Duplicate Indices 365

    11.3 Date Ranges, Frequencies, and Shifting 366

    Generating Date Ranges 367

    Frequencies and Date Offsets 370

    Shifting (Leading and Lagging) Data 371

    11.4 Time Zone Handling 374

    Time Zone Localization and Conversion 375

    Operations with Time Zone-Aware Timestamp Objects 377

    Operations Between Different Time Zones 378

    11.5 Periods and Period Arithmetic 379

    Period Frequency Conversion 380

    Quarterly Period Frequencies 382

    Converting Timestamps to Periods (and Back) 384

    Creating a PeriodIndex from Arrays 385

    11.6 Resampling and Frequency Conversion 387

    Downsampling 388

    Upsampling and Interpolation 391

    Resampling with Periods 392

    Grouped Time Resampling 394

    11.7 Moving Window Functions 396

    Exponentially Weighted Functions 399

    Binary Moving Window Functions 401

    User-Defined Moving Window Functions 402

    11.8 Conclusion 403

    12. Introduction to Modeling Libraries in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

    12.1 Interfacing Between pandas and Model Code 405

    12.2 Creating Model Descriptions with Patsy 408

    Data Transformations in Patsy Formulas 410

    Categorical Data and Patsy 412

    12.3 Introduction to statsmodels 415

    Estimating Linear Models 415

    Estimating Time Series Processes 419

    12.4 Introduction to scikit-learn 420

    12.5 Conclusion 423

    13. Data Analysis Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

    13.1 Bitly Data from 1.USA.gov 425

    Counting Time Zones in Pure Python 426

    Counting Time Zones with pandas 428

    13.2 MovieLens 1M Dataset 435

    Measuring Rating Disagreement 439

    13.3 US Baby Names 1880–2010 443

    Analyzing Naming Trends 448

    13.4 USDA Food Database 457

    13.5 2012 Federal Election Commission Database 463

    Donation Statistics by Occupation and Employer 466

    Bucketing Donation Amounts 469

    Donation Statistics by State 471

    13.6 Conclusion 472

    A. Advanced NumPy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

    A.1 ndarray Object Internals 473

    NumPy Data Type Hierarchy 474

    A.2 Advanced Array Manipulation 476

    Reshaping Arrays 476

    C Versus FORTRAN Order 478

    Concatenating and Splitting Arrays 479

    Repeating Elements: tile and repeat 481

    Fancy Indexing Equivalents: take and put 483

    A.3 Broadcasting 484

    Broadcasting over Other Axes 487

    Setting Array Values by Broadcasting 489

    A.4 Advanced ufunc Usage 490

    ufunc Instance Methods 490

    Writing New ufuncs in Python 493

    A.5 Structured and Record Arrays 493

    Nested Data Types and Multidimensional Fields 494

    Why Use Structured Arrays? 495

    A.6 More About Sorting 495

    Indirect Sorts: argsort and lexsort 497

    Alternative Sort Algorithms 498

    Partially Sorting Arrays 499

    numpy.searchsorted: Finding Elements in a Sorted Array 500

    A.7 Writing Fast NumPy Functions with Numba 501

    Creating Custom numpy.ufunc Objects with Numba 502

    A.8 Advanced Array Input and Output 503

    Memory-Mapped Files 503

    HDF5 and Other Array Storage Options 504

    A.9 Performance Tips 505

    The Importance of Contiguous Memory 505

    B. More on the IPython System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509

    B.1 Terminal Keyboard Shortcuts 509

    B.2 About Magic Commands 510

    The %run Command 512

    Executing Code from the Clipboard 513

    B.3 Using the Command History 514

    Searching and Reusing the Command History 514

    Input and Output Variables 515

    B.4 Interacting with the Operating System 516

    Shell Commands and Aliases 517

    Directory Bookmark System 518

    B.5 Software Development Tools 519

    Interactive Debugger 519

    Timing Code: %time and %timeit 523

    Basic Profiling: %prun and %run -p 525

    Profiling a Function Line by Line 527

    B.6 Tips for Productive Code Development Using IPython 529

    Reloading Module Dependencies 529

    Code Design Tips 530

    B.7 Advanced IPython Features 532

    Profiles and Configuration 532

    B.8 Conclusion 533



    Comments