When I first started learning Python, one of the things that really helped me was a basic introduction to Pandas. I work a lot with data, and pandas is a library that is crucial with manipulating data. Coming from a SQL background, one thing that I was looking to do was to select a subsection of entries from a Pandas dataframe. Recently I came across this post that is the Rosetta Stone between the two, showing a SQL command and showing how to obtain that same result with Pandas.1
As an example, here are some select statements translated into a Pandas command:
Select * from df where Age > 12
df[df['Age']>10]
Here, the goal is to make a dataframe from a dataframe where age is greater than 10. Pretty straightforward.
Select Name, Country from df where Age > 12
dfdf['Age'>10]['Name','Country']
Here, we’re selecting two fields from a dataframe. The first one part is setting the condition (where age is greater than 10) and the second bit shows the fields or columns that we’re looking for.
Something to note is that logical connectors work inside the dataframe, meaning that you can have something like the following:
df[(df['Age']>12) | (df['Height'] > 130)]
Overall, this basic introduction to pandas will make information selection much more efficient. Data science really is 90% getting the columns in the order that you want them, and 10% actually implementing whatever model you’re hoping to achieve. sPython really can be used throughout the data science pipeline, and I look forward to further experimentations with it.
-
While doing research for this post, I didn’t realize is that Pandas has a similar section in their documentation! ↩︎