Как выбрать несколько столбцов в Pandas (с примерами)
Существует три основных метода, которые вы можете использовать для выбора нескольких столбцов фрейма данных pandas:
Способ 1: выбор столбцов по индексу
Способ 2: выберите столбцы в диапазоне индексов
Способ 3: выберите столбцы по имени
В следующих примерах показано, как использовать каждый метод со следующими пандами DataFrame:
Способ 1: выбор столбцов по индексу
В следующем коде показано, как выбрать столбцы в позициях индекса 0, 1 и 3:
Обратите внимание, что выбраны столбцы в позициях индекса 0, 1 и 3.
Примечание.Первый столбец в кадре данных pandas находится в позиции 0.
Способ 2: выберите столбцы в диапазоне индексов
В следующем коде показано, как выбрать столбцы в диапазоне индексов от 0 до 3:
Обратите внимание, что столбец, расположенный в последнем значении диапазона (3), не будет включен в вывод.
Способ 3: выберите столбцы по имени
Следующий код показывает, как выбрать столбцы по имени:
Дополнительные ресурсы
В следующих руководствах объясняется, как выполнять другие распространенные операции в pandas:
23 эффективных способа разделения фрейма данных Pandas

В части 1 и части 2 мы узнали, как проверять, описывать и резюмировать фрейм данных Pandas. Сегодня мы узнаем, как извлечь подмножество фрейма данных Pandas. Это очень полезно, потому что мы часто хотим выполнять операции с подмножествами наших данных. Существует много разных способов разделения фрейма данных Pandas. Возможно, вам потребуется выбрать определенные столбцы со всеми строками. Иногда вам нужно выбрать определенные строки со всеми столбцами или выбрать строки и столбцы, соответствующие определенному критерию и т. Д.
Все способы разбиения на подмножества можно разделить на 4 категории: Выбор, Нарезка, Индексирование и Фильтрация.

Продолжая читать этот пост, вы узнаете о различиях между этими категориями.
Прежде чем обсуждать какой-либо из методов разделения фрейма данных, стоит различать объект Pandas Series и объект Pandas DataFrame.
Объекты Pandas Series и DataFrame
Серии и DataFrame — две основные структуры данных в Pandas. Просто Serie s похож на один столбец данных, а DataFrame похож на лист с строки и столбцы. Взгляните на следующую диаграмму:

Как видите, Series является одномерным, а DataFrame — двумерным. Если мы объединим два или более объектов серии вместе, мы можем получить объект DataFrame. Давайте посмотрим на фактический вид объекта Series.

Серия состоит из двух компонентов: Одномерные значения данных и Индекс. В индексе есть значимые метки для каждого значения данных. Пользователи могут использовать этот индекс для выбора значений. По умолчанию индекс начинается с 0.
Давайте посмотрим на фактический вид объекта DataFrame.

DataFrame состоит из трех компонентов: Двумерные значения данных, Индекс строки и Индекс столбца . Эти индексы предоставляют понятные метки для строк и столбцов. Пользователи могут использовать эти индексы для выбора строк и столбцов. По умолчанию индексы начинаются с 0.
Теперь мы обсудим различные способы разделения фрейма данных Pandas. Для пояснения я буду использовать «набор данных вина». Вот его часть.

Выбор
Когда мы захватываем весь столбец (столбцы), он называется Выбор. Выбранные столбцы содержат все строки.
Метод 1: выбор одного столбца с использованием имени столбца
Мы можем выбрать один столбец Pandas DataFrame, используя его имя. Если DataFrame упоминается как df, общий синтаксис следующий:
Результатом является серия Pandas, представляющая собой один столбец!

Метод 2: выбор нескольких столбцов с использованием имен столбцов
Мы можем выбрать несколько столбцов Pandas DataFrame, используя его имена. Мы можем определить имена столбцов внутри списка:
Затем мы можем включить этот список в df []. Общий синтаксис:
На этот раз на выходе будет фрейм данных Pandas!

Метод 3: выбор одного столбца с использованием атрибута .loc
Тот же результат в методе 1 можно получить с помощью атрибута .loc, который выбирает данные Pandas по метке (имя столбца).
Метод 4: выбор нескольких столбцов с использованием атрибута .loc
Тот же результат в методе 2 можно получить с помощью атрибута .loc, который выбирает данные Pandas по меткам (именам столбцов).
Общий синтаксис атрибута .loc:
Если меток несколько, их нужно указать внутри списков:
Если мы хотим выбрать все строки или столбцы, это можно сделать с помощью символа:. Самым важным в атрибуте .loc является то, что он выбирает данные Pandas по метке.
Метод 5: выбор одного столбца с помощью атрибута .iloc
Тот же результат в методе 1 можно получить с помощью атрибута .iloc, который выбирает данные Pandas по позиции (индекс столбца).
Переменная алкоголь находится в позиции 0 (первая переменная).
Метод 6: выбор нескольких столбцов с помощью атрибута .iloc
Тот же результат в методе 2 можно получить с помощью атрибута .iloc, который выбирает данные Pandas по позициям (индексы столбцов).
Общий синтаксис атрибута .iloc:
Если меток несколько, их нужно указать внутри списков:
Если мы хотим выбрать все строки или столбцы, это можно сделать с помощью записи :. Самым важным в атрибуте .iloc является то, что он выбирает данные Pandas по позиции с использованием числовых индексов.
Метод 7. Выбор последовательных столбцов с помощью атрибута .iloc (простой способ)
Мы можем выбрать первые 5 столбцов df следующим образом:
Мы также можем использовать следующий простой метод, чтобы получить тот же результат.
Для этого столбцы следует располагать последовательно. Диапазон 0: 5 включает 0 (первый столбец), исключает 5 (шестой столбец) и выбирает каждый столбец между диапазоном.
Метод 8: выбор последнего столбца
Во многих случаях часто бывает полезно выбрать последний столбец. Есть два метода:
Во-первых, мы можем подсчитать количество столбцов во фрейме данных с помощью атрибута .shape.
Последний столбец — 13-й, к которому можно получить доступ через индекс 12. Используя .iloc,

Второй способ намного проще. Здесь нам не нужно знать количество столбцов во фрейме данных.
-1 представляет последний столбец.
Нарезка
Когда мы хотим извлечь определенные строки из DataFrame, это называется нарезкой. Извлеченные строки называются срезами и содержат все столбцы.
Метод 9: выбор одной строки с помощью атрибута .iloc
Самый простой способ извлечь одну строку — использовать индекс строки внутри атрибута .iloc. Общий синтаксис:
Результатом является серия Pandas, содержащая значения строк.

Внешний вид немного сбивает с толку, поскольку на выходе получается серия Pandas. Если вы хотите, чтобы это была сама строка, просто используйте значения индекса внутри списка следующим образом:

Это фрейм данных Pandas, содержащий 1 строку и все столбцы!
Метод 10: выбор нескольких строк с помощью атрибута .iloc
Мы можем извлечь несколько строк из Pandas DataFrame, используя его индексы строк. Мы включаем индексы строк в список:
Затем мы включаем этот список в df.iloc [].
На выходе получается фрейм данных Pandas.

Метод 11: выбор последних нескольких строк
Отрицательные индексы считают строки снизу.

Индексирование
Когда мы объединяем выбор столбца и разрезание строк, это называется Индексирование. Здесь мы можем использовать атрибуты .loc и .iloc фрейма данных Pandas.
Метод 12: выбор одного значения с помощью атрибута .iloc
Если мы укажем одну строку и один столбец, пересечение будет одним значением!

Помните, что мы не можем использовать имена столбцов или строк внутри .iloc []. Могут использоваться только порядковые номера.
Метод 13: выбор одного значения с помощью атрибута .loc
Здесь мы можем использовать имена строк или столбцов внутри .loc []. Также имейте в виду, что в наших данных метки строк такие же, как и индексы строк. Следующий код дает тот же результат, что и в методе 12.
Метод 14: выбор нескольких строк и столбцов с помощью атрибута .iloc
На выходе получается фрейм данных Pandas.

Метод 15: выбор нескольких строк и столбцов с использованием атрибута .loc
На выходе получается фрейм данных Pandas.

Здесь мы можем использовать имена строк или столбцов внутри .loc []. Также имейте в виду, что в наших данных имена строк совпадают с индексами строк.
Метод 16: выбор последовательных строк и столбцов с использованием атрибутов .loc и .iloc (простой способ)
Это легко сделать с помощью записи :. Для этого строки и столбцы следует располагать последовательно.


Фильтрация
Когда мы выбираем строки и столбцы на основе определенных критериев или условий, это называется фильтрацией. Мы также можем комбинировать вышеупомянутые методы с этим.
Метод 17: фильтрация по одному критерию со всеми столбцами
Давайте рассмотрим подмножество наших данных, когда алкоголь ›14.3. Здесь мы выбираем все столбцы, когда алкоголь ›14.3.

Это серия Pandas с логическим типом данных. Мы можем использовать эту серию для получения необходимого подмножества данных.

Метод 18: фильтрация по одному критерию с несколькими столбцами
Давайте рассмотрим подмножество наших данных, когда алкоголь ›14.3. На этот раз мы выбираем только 3 столбца, когда алкоголь ›14,3. Для этого мы можем комбинировать описанный выше метод фильтрации с .loc [].

Метод 19: фильтрация на основе двух критериев с оператором И (тот же столбец)
Давайте возьмем подмножество наших данных, когда алкоголь ›14,3 И алкоголь‹ 14,6. Здесь мы используем два условия и комбинируем их с оператором AND. Каждое условие следует заключить в круглые скобки.

Метод 20: фильтрация по двум критериям с помощью метода between ()
Подобный тип фильтрации, описанный в методе 19, может быть достигнут с помощью метода между ().

Здесь вывод немного отличается, потому что метод между () по умолчанию включает значения нижней границы (14,3) и верхней границы (14,6). Однако мы можем передать inclusive = False, если нам не нужен исчерпывающий выбор.

Это подмножество точно такое же, как подмножество, полученное в методе 19.
Метод 21: фильтрация на основе двух критериев с оператором И (разные столбцы)
Здесь для двух условий используются два разных столбца: алкоголь и оттенок.

Метод 22: фильтрация по двум критериям с оператором ИЛИ
Когда мы используем оператор AND, фильтрация выполняется с учетом выполнения обоих условий. Если мы хотим, чтобы хотя бы одно условие выполнялось, мы можем использовать оператор OR.

Метод 23: фильтрация по минимальному и максимальному значениям
Давайте разберем наши данные на основе минимального и максимального значений переменной алкоголь. Сначала получаем индексы минимума и максимума:
Затем мы используем .iloc [].

Резюме
Это не единственные способы разделения фрейма данных Pandas. Есть еще много чего. Мы можем комбинировать несколько методов для сложного подмножества. Этот пост поможет вам познакомиться с синтаксисом подмножеств. Кроме того, теперь вы знакомы с терминами Выбор, Нарезка, Индексирование и Фильтрация. Также имейте в виду, что для .iloc требуются целочисленные значения (i для целых), а для .loc нужны значения ярлыков.
Это конец сегодняшней публикации. Мои читатели могут подписаться на членство по следующей ссылке, чтобы получить полный доступ ко всем рассказам, которые я пишу, и я получу часть вашего членского взноса.
Большое спасибо за вашу постоянную поддержку! Увидимся в следующей истории. Всем удачи!
Особая благодарность Гансу-Питеру Гаустеру на Unsplash, который предоставил мне красивую обложку для этого сообщения.
How do I select a subset of a DataFrame ?#
How do I select specific columns from a DataFrame ?#
I’m interested in the age of the Titanic passengers.
To select a single column, use square brackets [] with the column name of the column of interest.
Each column in a DataFrame is a Series . As a single column is selected, the returned object is a pandas Series . We can verify this by checking the type of the output:
And have a look at the shape of the output:
DataFrame.shape is an attribute (remember tutorial on reading and writing , do not use parentheses for attributes) of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.
I’m interested in the age and sex of the Titanic passengers.
To select multiple columns, use a list of column names within the selection brackets [] .
The inner square brackets define a Python list with column names, whereas the outer brackets are used to select the data from a pandas DataFrame as seen in the previous example.
The returned data type is a pandas DataFrame:
The selection returned a DataFrame with 891 rows and 2 columns. Remember, a DataFrame is 2-dimensional with both a row and column dimension.
For basic information on indexing, see the user guide section on indexing and selecting data .
How do I filter specific rows from a DataFrame ?#
I’m interested in the passengers older than 35 years.
To select rows based on a conditional expression, use a condition inside the selection brackets [] .
The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a value larger than 35:
The output of the conditional expression ( > , but also == , != , < , <= ,… would work) is actually a pandas Series of boolean values (either True or False ) with the same number of rows as the original DataFrame . Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets [] . Only rows for which the value is True will be selected.
We know from before that the original Titanic DataFrame consists of 891 rows. Let’s have a look at the number of rows which satisfy the condition by checking the shape attribute of the resulting DataFrame above_35 :
I’m interested in the Titanic passengers from cabin class 2 and 3.
Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets [] . In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which rows the Pclass column is either 2 or 3.
The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an | (or) operator:
When combining multiple conditional statements, each condition must be surrounded by parentheses () . Moreover, you can not use or / and but need to use the or operator | and the and operator & .
See the dedicated section in the user guide about boolean indexing or about the isin function .
I want to work with passenger data for which the age is known.
The notna() conditional function returns a True for each row the values are not a Null value. As such, this can be combined with the selection brackets [] to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if the shape has changed:
For more dedicated functions on missing values, see the user guide section about handling missing data .
How do I select specific rows and columns from a DataFrame ?#
I’m interested in the names of the passengers older than 35 years.
In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. The loc / iloc operators are required in front of the selection brackets [] . When using loc / iloc , the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using the column names, row labels or a condition expression, use the loc operator in front of the selection brackets [] . For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.
I’m interested in rows 10 till 25 and columns 3 to 5.
Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc operator in front of the selection brackets [] .
When selecting specific rows and/or columns with loc or iloc , new values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the fourth column:
See the user guide section on different choices for indexing to get more insight in the usage of loc and iloc .
REMEMBER
When selecting subsets of data, square brackets [] are used.
Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
Select specific rows and/or columns using loc when using the row and column names.
Select specific rows and/or columns using iloc when using the positions in the table.
You can assign new values to a selection based on loc / iloc .
A full overview of indexing is provided in the user guide pages on indexing and selecting data .
Part 1: Selection with [ ] , .loc and .iloc
![]()
This is the beginning of a four-part series on how to select subsets of data from a pandas DataFrame or Series. Pandas offers a wide variety of options for subset selection which necessitates multiple articles. This series is broken down into the following four topics.
Learn More
Master Data Analysis with Python is an extremely comprehensive text with over 80 chapters and 500 exercises to help you become an expert.
Assumptions before we begin
These series of articles assume you have no knowledge of pandas, but that you understand the fundamentals of the Python programming language. It also assumes that you have installed pandas on your machine.
The easiest way to get pandas along with Python and the rest of the main scientific computing libraries is to install the Miniconda distribution (follow the link for a comprehensive tutorial).
If you have no knowledge of Python then I suggest completing an introductory book like Master the Fundamentals of Python cover to cover.
The importance of making subset selections
You might be wondering why there need to be so many articles on selecting subsets of data. This topic is extremely important to pandas and it’s unfortunate that it is fairly complicated because subset selection happens frequently during an actual analysis. Because you are frequently making subset selections, you need to master it in order to make your life with pandas easier.
Always reference the documentation
The material in this article is also covered in the official pandas documentation on Indexing and Selecting Data. I highly recommend that you read that part of the documentation along with this tutorial. In fact, the documentation is one of the primary means for mastering pandas. I wrote a step-by-step article, How to Learn Pandas, which gives suggestions on how to use the documentation as you master pandas.
The anatomy of a DataFrame and a Series
The pandas library has two primary containers of data, the DataFrame and the Series. You will spend nearly all your time working with both of the objects when you use pandas. The DataFrame is used more than the Series, so let’s take a look at an image of it first.
This image comes with some added illustrations to highlight its components. At first glance, the DataFrame looks like any other two-dimensional table of data that you have seen. It has rows and it has columns. Technically, there are three main components of the DataFrame.
The three components of a DataFrame
A DataFrame is composed of three different components, the index, columns, and the data. The data is also known as the values.
The index represents the sequence of values on the far left-hand side of the DataFrame. All the values in the index are in bold font. Each individual value of the index is called a label. Sometimes the index is referred to as the row labels. In the example above, the row labels are not very interesting and are just the integers beginning from 0 up to n-1, where n is the number of rows in the table. Pandas defaults DataFrames with this simple index.
The columns are the sequence of values at the very top of the DataFrame. They are also in bold font. Each individual value of the columns is called a column, but can also be referred to as column name or column label.
Everything else not in bold font is the data or values. You will sometimes hear DataFrames referred to as tabular data. This is just another name for a rectangular table data with rows and columns.
Axis and axes
It is also common terminology to refer to the rows or columns as an axis. Collectively, we call them axes. So, a row is an axis and a column is another axis.
The word axis appears as a parameter in many DataFrame methods. Pandas allows you to choose the direction of how the method will work with this parameter. This has nothing to do with subset selection so you can just ignore it for now.
Each row has a label and each column has a label
The main takeaway from the DataFrame anatomy is that each row has a label and each column has a label. These labels are used to refer to specific rows or columns in the DataFrame. It’s the same as how humans use names to refer to specific people.
What is subset selection?
Before we start doing subset selection, it might be good to define what it is. Subset selection is simply selecting particular rows and columns of data from a DataFrame (or Series). This could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns.
Example selecting some columns and all rows
Let’s see some images of subset selection. We will first look at a sample DataFrame with fake data.
Let’s say we want to select just the columns color , age , and height but keep all the rows.
Our final DataFrame would look like this:
Example selecting some rows and all columns
We can also make selections that select just some of the rows. Let’s select the rows with labels Aaron and Dean along with all of the columns:
Our final DataFrame would like:
Example selecting some rows and some columns
Let’s combine the selections from above and select the columns color , age , and height for only the rows with labels Aaron and Dean .
Our final DataFrame would look like this:
Pandas dual references: by label and by integer location
We already mentioned that each row and each column have a specific label that can be used to reference them. This is displayed in bold font in the DataFrame.
But, what hasn’t been mentioned, is that each row and column may be referenced by an integer as well. I call this integer location. The integer location begins at 0 and ends at n-1 for each row and column. Take a look above at our sample DataFrame one more time.
The rows with labels Aaron and Dean can also be referenced by their respective integer locations 2 and 4. Similarly, the columns color , age and height can be referenced by their integer locations 1, 3, and 4.
The documentation refers to integer location as position. I don’t particularly like this terminology as its not as explicit as integer location. The key thing term here is INTEGER.
What’s the difference between indexing and selecting subsets of data?
The documentation uses the term indexing frequently. This term is essentially just a one-word phrase to say ‘subset selection’. I prefer the term subset selection as, again, it is more descriptive of what is actually happening. Indexing is also the term used in the official Python documentation.
Focusing only on [] , .loc , and .iloc
There are many ways to select subsets of data, but in this article we will only cover the usage of the square brackets ( [] ), .loc and .iloc . Collectively, they are called the indexers. These are by far the most common ways to select data. A different part of this Series will discuss a few methods that can be used to make subset selections.
If you have a DataFrame, df , your subset selection will look something like the following:
A real subset selection will have something inside of the square brackets. All selections in this article will take place inside of those square brackets.
Notice that the square brackets also follow .loc and .iloc . All indexing in Python happens inside of these square brackets.
A term for just those square brackets
The term indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. I will use the term just the indexing operator to refer to df[] . This will distinguish it from df.loc[] and df.iloc[] .
Read in data into a DataFrame with read_csv
Let’s begin using pandas to read in a DataFrame, and from there, use the indexing operator by itself to select subsets of data. All the data for these tutorials are in the data directory.
We will use the read_csv function to read in data into a DataFrame. We pass the path to the file as the first argument to the function. We will also use the index_col parameter to select the first column of data as the index (more on this later).
Extracting the individual DataFrame components
Earlier, we mentioned the three components of the DataFrame. The index, columns and data (values). We can extract each of these components into their own variables. Let’s do that and then inspect them:
Data types of the components
Let’s output the type of each component to understand exactly what kind of object they are.
Understanding these types
Interestingly, both the index and the columns are the same type. They are both a pandas Index object. This object is quite powerful in itself, but for now you can just think of it as a sequence of labels for either the rows or the columns.
The values are a NumPy ndarray , which stands for n-dimensional array, and is the primary container of data in the NumPy library. Pandas is built directly on top of NumPy and it's this array that is responsible for the bulk of the workload.
Beginning with just the indexing operator on DataFrames
We will begin our journey of selecting subsets by using just the indexing operator on a DataFrame. Its main purpose is to select a single column or multiple columns of data.
Selecting a single column as a Series
To select a single column of data, simply put the name of the column in-between the brackets. Let’s select the food column:
Anatomy of a Series
Selecting a single column of data returns the other pandas data container, the Series. A Series is a one-dimensional sequence of labeled data. There are two main components of a Series, the index and the data(or values). There are NO columns in a Series.
The visual display of a Series is just plain text, as opposed to the nicely styled table for DataFrames. The sequence of person names on the left is the index. The sequence of food items on the right is the values.
You will also notice two extra pieces of data on the bottom of the Series. The name of the Series becomes the old-column name. You will also see the data type or dtype of the Series. You can ignore both these items for now.
Selecting multiple columns with just the indexing operator
It’s possible to select multiple columns with just the indexing operator by passing it a list of column names. Let’s select color , food , and score :
Selecting multiple columns returns a DataFrame
Selecting multiple columns returns a DataFrame. You can actually select a single column as a DataFrame with a one-item list:
Although, this resembles the Series from above, it is technically a DataFrame, a different object.
Column order doesn’t matter
When selecting multiple columns, you can select them in any order that you choose. It doesn’t have to be the same order as the original DataFrame. For instance, let’s select height and color .
Exceptions
There are a couple common exceptions that arise when doing selections with just the indexing operator.
- If you misspell a word, you will get a KeyError
- If you forgot to use a list to contain multiple columns you will also get a KeyError
Summary of just the indexing operator
- Its primary purpose is to select columns by the column names
- Select a single column as a Series by passing the column name directly to it: df['col_name']
- Select multiple columns as a DataFrame by passing a list to it: df[['col_name1', 'col_name2']]
- You actually can select rows with it, but this will not be shown here as it is confusing and not used often.
Getting started with .loc
The .loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns. Most importantly, it only selects data by the LABEL of the rows and columns.
Select a single row as a Series with .loc
The .loc indexer will return a single row as a Series when given a single row label. Let's select the row for Niko .
We now have a Series, where the old column names are now the index labels. The name of the Series has become the old index label, Niko in this case.
Select multiple rows as a DataFrame with .loc
To select multiple rows, put all the row labels you want to select in a list and pass that to .loc . Let's select Niko and Penelope .
Use slice notation to select a range of rows with .loc
It is possible to ‘slice’ the rows of a DataFrame with .loc by using slice notation. Slice notation uses a colon to separate start, stop and step values. For instance we can select all the rows from Niko through Dean like this:
.loc includes the last value with slice notation
Notice that the row labeled with Dean was kept. In other data containers such as Python lists, the last value is excluded.
Other slices
You can use slice notation similarly to how you use it with lists. Let’s slice from the beginning through Aaron :
Slice from Niko to Christina stepping by 2:
Slice from Dean to the end:
Selecting rows and columns simultaneously with .loc
Unlike just the indexing operator, it is possible to select rows and columns simultaneously with .loc . You do it by separating your row and column selections by a comma. It will look something like this:
Select two rows and three columns
For instance, if we wanted to select the rows Dean and Cornelia along with the columns age , state and score we would do this:
Use any combination of selections for either row or columns for .loc
Row or column selections can be any of the following as we have already seen:
- A single label
- A list of labels
- A slice with labels
We can use any of these three for either row or column selections with .loc . Let's see some examples.
Let’s select two rows and a single column:
Select a slice of rows and a list of columns:
Select a single row and a single column. This returns a scalar value.
Select a slice of rows and columns
Selecting all of the rows and some columns
It is possible to select all of the rows by using a single colon. You can then select columns as normal:
You can also use this notation to select all of the columns:
But, it isn’t necessary as we have seen, so you can leave out that last colon:
Assign row and column selections to variables
It might be easier to assign row and column selections to variables before you use .loc . This is useful if you are selecting many rows or columns:
Summary of .loc
- Only uses labels
- Can select rows and columns simultaneously
- Selection can be a single label, a list of labels or a slice of labels
- Put a comma between row and column selections
If you are enjoying this article, consider purchasing the All Access Pass which includes all my current and future material for one low price.
Getting started with .iloc
The .iloc indexer is very similar to .loc but only uses integer locations to make its selections. The word .iloc itself stands for integer location so that should help with remember what it does.
Selecting a single row with .iloc
By passing a single integer to .iloc , it will select one row as a Series:
Selecting multiple rows with .iloc
Use a list of integers to select multiple rows:
Use slice notation to select a range of rows with .iloc
Slice notation works just like a list in this instance and is exclusive of the last element
Select 3rd position until end:
Select 3rd position to end by 2:
Master Python, Data Science and Machine Learning
Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains:
-
— A comprehensive introduction to Python (300+ pages, 150+ exercises) — The most comprehensive course available to learn pandas. (800+ pages and 350+ exercises) — A deep dive into doing machine learning with scikit-learn constantly updated to showcase the latest and greatest tools. (300+ pages)
Selecting rows and columns simultaneously with .iloc
Just like with .iloc any combination of a single integer, lists of integers or slices can be used to select rows and columns simultaneously. Just remember to separate the selections with a comma.
Select two rows and two columns:
Select a slice of the rows and two columns:
Select slices for both
Select a single row and column
Select all the rows and a single column
Deprecation of .ix
Early in the development of pandas, there existed another indexer, ix . This indexer was capable of selecting both by label and by integer location. While it was versatile, it caused lots of confusion because it's not explicit. Sometimes integers can also be labels for rows or columns. Thus there were instances where it was ambiguous.
You can still call .ix , but it has been deprecated, so please never use it.
Selecting subsets of Series
We can also, of course, do subset selection with a Series. Earlier I recommended using just the indexing operator for column selection on a DataFrame. Since Series do not have columns, I suggest using only .loc and .iloc . You can use just the indexing operator, but its ambiguous as it can take both labels and integers. I will come back to this at the end of the tutorial.
Typically, you will create a Series by selecting a single column from a DataFrame. Let’s select the food column:
Series selection with .loc
Series selection with .loc is quite simple, since we are only dealing with a single dimension. You can again use a single row label, a list of row labels or a slice of row labels to make your selection. Let's see several examples.
Let’s select a single value:
Select three different values. This returns a Series:
Slice from Niko to Christina — is inclusive of last index
Slice from Penelope to the end:
Select a single value in a list which returns a Series
Series selection with .iloc
Series subset selection with .iloc happens similarly to .loc except it uses integer location. You can use a single integer, a list of integers or a slice of integers. Let's see some examples.
Select a single value:
Use a list of integers to select multiple values:
Use a slice — is exclusive of last integer
Comparison to Python lists and dictionaries
It may be helpful to compare pandas ability to make selections by label and integer location to that of Python lists and dictionaries.
Python lists allow for selection of data only through integer location. You can use a single integer or slice notation to make the selection but NOT a list of integers.
Let’s see examples of subset selection of lists using integers:
Selection by label with Python dictionaries
All values in each dictionary are labeled by a key. We use this key to make single selections. Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.
Pandas has power of lists and dictionaries
DataFrames and Series are able to make selections with integers like a list and with labels like a dictionary.
Extra Topics
There are a few more items that are important and belong in this tutorial and will be mentioned now.
Using just the indexing operator to select rows from a DataFrame — Confusing!
Above, I used just the indexing operator to select a column or columns from a DataFrame. But, it can also be used to select rows using a slice. This behavior is very confusing in my opinion. The entire operation changes completely when a slice is passed.
Let’s use an integer slice as our first example:
To add to this confusion, you can slice by labels as well.
I recommend not doing this!
This feature is not deprecated and completely up to you whether you wish to use it. But, I highly prefer not to select rows in this manner as can be ambiguous, especially if you have integers in your index.
Using .iloc and .loc is explicit and clearly tells the person reading the code what is going to happen. Let's rewrite the above using .iloc and .loc .
Cannot simultaneously select rows and columns with []
An exception will be raised if you try and select rows and columns simultaneously with just the indexing operator. You must use .loc or .iloc to do so.
Using just the indexing operator to select rows from a Series — Confusing!
You can also use just the indexing operator with a Series. Again, this is confusing because it can accept integers or labels. Let’s see some examples
Since Series don’t have columns you can use a single label and list of labels to make selections as well
Again, I recommend against doing this and always use .iloc or .loc
Importing data without choosing an index column
We imported data by choosing the first column to be the index with the index_col parameter of the read_csv function. This is not typically how most DataFrames are read into pandas.
Usually, all the columns in the csv file become DataFrame columns. Pandas will use the integers 0 to n-1 as the labels. See the example data below with a slightly different dataset:
The default RangeIndex
If you don’t specify a column to be the index when first reading in the data, pandas will use the integers 0 to n-1 as the index. This technically creates a RangeIndex object. Let's take a look at it.
This object is similar to Python range objects. Let's create one:
Converting both of these objects to a list produces the exact same thing:
For now, it’s not at all important that you have a RangeIndex . Selections from it happen just the same with .loc and .iloc . Let's look at some examples.
There is a subtle difference when using a slice. .iloc excludes the last value, while .loc includes it:
Setting an index from a column after reading in data
It is common to see pandas code that reads in a DataFrame with a RangeIndex and then sets the index to be one of the columns. This is typically done with the set_index method:
The index has a name
Notice that this DataFrame does not look exactly like our first one from the very top of this tutorial. Directly above the index is the bold-faced word Names . This is technically the name of the index. Our original DataFrame had no name for its index. You can ignore this small detail for now. Subset selections will happen in the same fashion.
DataFrame column selection with dot notation
Pandas allows you to select a single column as a Series by using dot notation. This is also referred to as attribute access. You simply place the name of the column without quotes following a dot and the DataFrame like this:
Pros and cons when selecting columns by attribute access
The best benefit of selecting columns like this is that you get help when chaining methods after selection. For instance, if you place another dot after the column name and press tab, a list of all the Series methods will appear in a pop-up menu. It will look like this:
This help disappears when you use just the indexing operator:
The biggest drawback is that you cannot select columns that have spaces or other characters that are not valid as Python identifiers (variable names).
Selecting the same column twice?
This is rather peculiar, but you can actually select the same column more than once:
Summary of Part 1
We covered an incredible amount of ground. Let’s summarize all the main points:
- Before learning pandas, ensure you have the fundamentals of Python
- Always refer to the documentation when learning new pandas operations
- The DataFrame and the Series are the containers of data
- A DataFrame is two-dimensional, tabular data
- A Series is a single dimension of data
- The three components of a DataFrame are the index, the columns and the data (or values)
- Each row and column of the DataFrame is referenced by both a label and an integer location
- There are three primary ways to select subsets from a DataFrame — [] , .loc and .iloc
- I use the term just the indexing operator to refer to [] immediately following a DataFrame/Series
- Just the indexing operator’s primary purpose is to select a column or columns from a DataFrame
- Using a single column name to just the indexing operator returns a single column of data as a Series
- Passing multiple columns in a list to just the indexing operator returns a DataFrame
- A Series has two components, the index and the data (values). It has no columns
- .loc makes selections only by label
- .loc can simultaneously select rows and columns
- .loc can make selections with either a single label, a list of labels, or a slice of labels
- .loc makes row selections first followed by column selections: df.loc[row_selection, col_selection]
- .iloc is analogous to .loc but uses only integer location to refer to rows or columns.
- .ix is deprecated and should never be used
- .loc and .iloc work the same for Series except they only select based on the index as there are no columns
- Pandas combines the power of python lists (selection via integer location) and dictionaries (selection by label)
- You can use just the indexing operator to select rows from a DataFrame, but I recommend against this and instead sticking with the explicit .loc and .iloc
- Normally data is imported without setting an index. Use the set_index method to use a column as an index.
- You can select a single column as a Series from a DataFrame with dot notation
Way more to the story
This is only part 1 of the series, so there is much more to cover on how to select subsets of data in pandas. Some of the explanations in this part will be expanded to include other possibilities.
Master Python, Data Science and Machine Learning
Immerse yourself in my comprehensive path for mastering data science and machine learning with Python. Purchase the All Access Pass to get lifetime access to all current and future courses. Some of the courses it contains: