Using Statistics: Data Analysis on Female Pima Indians and NYC Airbnb

Back in 2003-2007, statistics was one of those majors that no one touched. In fact, I remember graduating in 2007 at the University of Michigan, a school with over 30,000 undergrads, with only about 20 other statistics majors. On graduation day, all the graduates sat in the statistics faculty lounge and received a USB junk drive that probably had 500 MB of space (suuuh cuuuute!). We felt so cool.

During my time in school, I loved using R - I kid you not - I was quite weird. I wouldn't say that I was the best statistician by far, but my computer coding skills in statistical packages were just "muah" (five-finger kiss into an exploding hand bomb). My data analysis was always intuitive, efficient, and clear. My stats classmates (all 5 of them) would be reeling to grab my R Workspace code so they could use it to finish their homework. Flash forward 12 years later - data analytics, data science, data-driven decision making, and data blah blah blah are all the rage. Meanwhile, I haven't done any real stats work outside of some Excel pivot tables or a "recording and modifying" of Excel macros (that surprisingly still amaze some people), while my actual statistics skills have gone to shit. As such, I am inspired to re-learn R.

I'm starting at the very beginning and super simple, because frankly, this is just for fun. In this post, I am just going to do some very basic data clean-up and quick initial data look-throughs. I'll make sure to post my R code to download below. Take a look, if you are a nerd like me. Also, if you think I did anything wrong, let me know! I still make mistakes counting sometimes and I'm sure I'll make mistakes here :)

*Beware: Formatting of the outputs is not what I am after here. I'm writing R code to look at statistical outputs and using statistical methodology focused around the practice of regression and analysis of variance. I do throw in a few comments here and there.

Data Set #1: Diabetes and Kidney Disease Study on Adult Female Pima Indians

My first set of data is from The National Institute of Diabetes and Digestive and Kidney Diseases. The institute conducted a study on 768 adult female Pima Indians living near Phoenix. The following variables were recorded:

  • Number of times pregnant
  • Plasma glucose concentration at 2 hours in an oral glucose tolerance test
  • Diastolic blood pressure (mmHg)
  • Triceps skinfold thickness (mm)
  • 2-hour serum insulin (mu U/ml)
  • Body mass index (weight in kg/(height in m2))
  • Diabetes pedigree function, age (years)
  • A test of whether the patient showed signs of diabetes (coded zero if negative, one if positive)

Source Data: UCI Repository of machine learning databases

Edward's R Code: Initial Data Analysis in R for Data Set #1

Data Set #2: New York City Home Listing Data on Airbnb

For my second set of data, I wanted to make sure to look at something closer to home, so I decided to look at Airbnb listings in New York City. I airbnb'd my place a lot to make ends meet, so I'm kinda obsessed with them. There are also a lot of regulatory issues surrounding home sharing in NYC right now, so maybe there might be some insights that can be taken away from actual data. I quickly found a website that had some Airbnb data that was updated on March 2018 (I'm not exactly sure if its real, but this is just for fun, so it doesn't bother me). Sadly, the source of this data didn't include an explanation of the actual variables, so I can only infer from the header names are - most are pretty intuitive. The data included these variables:

  • Id
  • Name
  • Host_id
  • Host_name
  • Neighbourhood_group
  • Neighbourhood**
  • Latitude
  • Longitude
  • Room_type
  • price minimum_nights
  • Number_of_reviews
  • Last_review
  • Reviews_per_month
  • Calculated_host_listings_count
  • Availability_365

**The way they spelled neighborhood bothered me. LOL. It killed me when I was trying to enter into R, because I kept spelling it wrong!

Source Data:

Edward's R Code: Initial Data Analysis in R for Data Set #2