Kaggle is a great source of data, you can get any kind of dataset for your practice, but it could be a hassle for someone to download the kaggle dataset, place it in your current working directory and extract it. So, today I will be guiding you through the process of downloading any Kaggle dataset, right through your Jupyter or Colab notebook.
Read more-> what is the difference between Colab and Jupyter and which is best for you?
we will first setup Colab then we will see about jupyter notebook, to download any Kaggle dataset you must have a Kaggle account,
Get Kaggle API Token
The first step is to download your “API token”, which you can do by visiting your Kaggle account setting, there you will see a section called ‘API’.
Now there will be two options/buttons ‘Create New API Token’ and ‘Expire API Token’. The second one is for deleting your previous API token if you already have created any, if you have then click the second option and after that, the first one, or if you haven’t you can directly click on the 1st button which will download a file called ‘Kaggle.json’. Which will contain your username and API token.
{"username":"Your username","key":"2f4997fa1d8e4f56ad8eb7659aaf1c31"}
Setup Colab
After you get your token the next step would be to visit Colab and login through your Google account, if you don’t have pls create one. And the next steps are as follows.
Install Kaggle
!pip install kaggle
Upload downloaded JSON file
from google.colab import file
files.upload()
Now, we have to move the kaggle.json file to the .kaggle folder in the home directory so that kaggle can easily find it (json file which contains your credentials) when we made any request
#Make a directory named kaggle and copy the kaggle.json file there.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
# change the permission of the file
!chmod 600 ~/.kaggle/kaggle.json
Get Kaggle dataset API command
Now, everything is done, we just need to get the API command for the dataset you want to download. So, for that, you have to open the kaggle and you have to find the dataset you want to work with(Currently, I am using the Netflix dataset for the demonstration purpose) after that click on the three dots which you will find on the right side of the ‘New Notebook’ option, inside there you will see an option called ‘copy API command’, click on that and now you have it. After getting that run this command
# !your dataset api command
!kaggle datasets download -d shivamb/netflix-shows #here I am using neftlix dataset api
This will start downloading the dataset which will be in a .zip file. So, let’s extract the file by the following command
from zipfile import ZipFile
file_name = 'netflix-shows.zip' #the file is your dataset exact name
with ZipFile(file_name, 'r') as zip:
zip.extractall()
print('Done')
And that’s it, you can now start working on it by reading it through pandas
import pandas as pd
data = pd.read_csv('netflix_titles.csv')
Jupyter notebook
Every step is the same in the jupyter notebook as well, just the difference is the way of execution of the process. We will be working in Command prompt so, open your command prompt and follow the steps
Install kaggle
pip install kaggle
Now, we have to make .kaggle folder in the same directory where your python, jupyter notebook is installed usually it’s your home directory. So, if you are not in your home directory by default then you can change it through ‘cd’ command followed by your home directory
cd C:\users\buggyprogrammer
Create .kaggle folder
mkdir .kaggle
Now go back to the download folder or where the JSON file is download and then move it to .kaggle folder, you can do it manually by simply copying the folder from the file explorer then pasting it into your .kaggle folder or you can do it by this command
Download directory
cd C:\users\buggyprogrammer\downloads
Move json folder
move kaggle.json C:\users\buggyprogrammer\.kaggle
Now we are all set, now go to your jupyter notebook/lab and follow this step to download the dataset
import kaggle
!kaggle datasets download -d shivamb/netflix-shows
Extract the folder
from zipfile import ZipFile
file_name = 'netflix-shows.zip' #the file is your dataset exact name
with ZipFile(file_name, 'r') as zip:
zip.extractall()
print('Done')
And that’s it, you can now start working on it by reading it through pandas
import pandas as pd
data = pd.read_csv('netflix_titles.csv')
Congrats, 😀🎉
Data Scientist with 3+ years of experience in building data-intensive applications in diverse industries. Proficient in predictive modeling, computer vision, natural language processing, data visualization etc. Aside from being a data scientist, I am also a blogger and photographer.