Pandas >> How to Read various data in Pandas
Table of Contents
In this article, we will discuss how to read various type of data in Pandas
including pickle, clipboard, excel, json, HTML, XML, parquet files.
How to read pickle file
We can use read_pickle
method of pandas
and specify the path of pickle file to read pickle file.
# import pandas module
import pandas as pd
# use read_pickle method, and return pandas DataFrame object
df = pd.read_pickle("path/to/sample.pkl")
How to read content from clipboard
We can use read_clipboard
of pandas
to read content from clipboard.
We can use sep
option to specify delimiter and the default setting is one or more whitespace characters.
# import pandas module
import pandas as pd
# use sep option to specifying delimiter
df = pd.read_clipboard(sep=';')
How to read excel file
We can use read_excel
of pandas
to read content from Excel.
# import pandas module
import pandas as pd
# We can specify sheet name by specifying `sheet_name` option.
# If we don't want to have the index, we can specify `index_col=None`.
# If we want to treat the first line as header, we can use `header=0`.
df = pd.read_excel('path/to/sample.xlsx', sheet_name='Sheet1', index_col=None, header=0)
# If there is no header, we can use `header=None`
df = pd.read_excel('path/to/sample.xlsx', sheet_name='Sheet1', index_col=None, header=None)
How to read json file
json file can store unstructured data, we can use read_json
to read json file into pandas.
orient
option is used to indicate JSON string format:
orient='split'
: JSON string is dict like{index -> [index], columns -> [columns], data -> [values]}
orient='records'
: JSON string is list like[{column -> value}, ... , {column -> value}]
orient='index'
: JSON string is dict like{index -> {column -> value}}
orient='columns'
: JSON string is dict like{column -> {index -> value}}
orient='values'
: JSON string is just the values array
# import pandas module
import pandas as pd
df = pd.read_json('path/to/sample.json', orient='split')
How to read HTML file
HTML file can be read into pandas using read_html
method of pandas.
read_html
will return a list of DataFrames.
# import pandas module
import pandas as pd
# `match` is used to find the expected table and read it
# default settings is `.+` (match any non-empty string)
# If we have a table containing "John", we can specify `match='John'` to find it.
list_of_df = pd.read_html('str, path object or file-like object', match='John')
print(len(list_of_df))
How to read XML file
We can use read_xml
of pandas
to read content from XML file.
# import pandas module
import pandas as pd
xml = '''<?xml version='1.0' encoding='utf-8'?>
<data>
<row id="1" name="Kevin" score="80"/>
<row id="2" name="John"/>
<row id="3" name="Jack" score="90"/>
</data>'''
# We can specify xpath to find wanted nodes. default ‘./*’
df = pd.read_xml(xml, xpath=".//row")
Result
id | name | score | |
---|---|---|---|
0 | 1 | Kevin | 80 |
1 | 2 | John | NaN |
2 | 3 | Jack | 90 |
How to read parquet file
We can use read_parquet
of pandas
to read content from parquet file.
Apache Parquet is a free and open-source column-oriented data storage format of the
Apache Hadoop
ecosystem.
# import pandas module
import pandas as pd
# engine option is used to specify library of parquet.
# {‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’
df = pd.read_xml('path/to/sample.parquet', engine="pyarrow")