Skip to content

Pandas

Simple Descriptive Statistics

Before taking the big guns, such as aggregations, I usually try to first get a good grasp of the data inside the data frame. Looking at the head and tail of the data framew is one of the first things you should do:

# check the top of the data frame
df.head()

# check the botto of the data frame
df.tail()

If the data frame is very large you can also grab a random sample of it:

# Fetch an exact number of random lines (make sure that freq=None)
df.sample(n=5)

# Fetch a percentage of random lines (make sure that n=None)
df.sample(freq=0.1)

Last but not least, you can try describe and value_counts. describe will return a new data frame consisting of various basic stats for each of the numeric columns in the data frame (count, mean, std, min, max, 25%, 50% 75%):

df.describe()

#            price  area
#   count   150.0   150.000000
#   mean    943139.5    100.085800
#   std     844677.0    44.566817
#   min     62000.0 26.220000
#   25%     419875.0    70.740000
#   50%     585500.0    93.710000
#   75%     1132500.0   126.697500
#   max     4500000.0   279.500000
  • TODO: Document value_counts

value_counts gets applied applied to a discrete series instead, and is a really easy way of checking how many rows there of each value there are in a data frame:

df["city"].value_counts()

#   München    96
#   Hamburg    24
#   Berlin     16
#   Bremen     14

Encode categorical data

Simple Encoding

There are many times, when you would need to turn your series of categorical data (e.g. city names, professions, car models, etc) into numeric representations. The simplest way to do this using Pandas, is to call factorize on the series you want to get encoded.12

factorize

Encode the object as an enumerated type or categorical variable. This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method

df['c_code'] = pd.factorize(df['city'])[0]

print(df)

#    city      c_code
#  München  0
#  Bremen   1
#  Berlin   2
#  Hamburg  3