Pandas >> How to Read various data in Pandas

2021-10-11 Pandas

Table of Contents

In this article, we will discuss how to read various type of data in Pandas including pickle, clipboard, excel, json, HTML, XML, parquet files.

[Pandas] How to read various data

How to read pickle file

We can use read_pickle method of pandas and specify the path of pickle file to read pickle file.

# import pandas module
import pandas as pd

# use read_pickle method, and return pandas DataFrame object
df = pd.read_pickle("path/to/sample.pkl")

How to read content from clipboard

We can use read_clipboard of pandas to read content from clipboard. We can use sep option to specify delimiter and the default setting is one or more whitespace characters.

# import pandas module
import pandas as pd

# use sep option to specifying delimiter
df = pd.read_clipboard(sep=';')


How to read excel file

We can use read_excel of pandas to read content from Excel.

# import pandas module
import pandas as pd

# We can specify sheet name by specifying `sheet_name` option.
# If we don't want to have the index, we can specify `index_col=None`.
# If we want to treat the first line as header, we can use `header=0`.
df = pd.read_excel('path/to/sample.xlsx', sheet_name='Sheet1', index_col=None, header=0)  

# If there is no header, we can use `header=None`
df = pd.read_excel('path/to/sample.xlsx', sheet_name='Sheet1', index_col=None, header=None)


How to read json file

json file can store unstructured data, we can use read_json to read json file into pandas.

orient option is used to indicate JSON string format:

  • orient='split': JSON string is dict like {index -> [index], columns -> [columns], data -> [values]}
  • orient='records': JSON string is list like [{column -> value}, ... , {column -> value}]
  • orient='index': JSON string is dict like {index -> {column -> value}}
  • orient='columns': JSON string is dict like {column -> {index -> value}}
  • orient='values': JSON string is just the values array
# import pandas module
import pandas as pd

df = pd.read_json('path/to/sample.json', orient='split')  

How to read HTML file

HTML file can be read into pandas using read_html method of pandas. read_html will return a list of DataFrames.

# import pandas module
import pandas as pd

# `match` is used to find the expected table and read it
# default settings is `.+` (match any non-empty string)
# If we have a table containing "John", we can specify `match='John'` to find it.
list_of_df = pd.read_html('str, path object or file-like object', match='John')  

print(len(list_of_df))

How to read XML file

We can use read_xml of pandas to read content from XML file.

The syntax of xpath

# import pandas module
import pandas as pd

xml = '''<?xml version='1.0' encoding='utf-8'?>
<data>
  <row id="1" name="Kevin" score="80"/>
  <row id="2" name="John"/>
  <row id="3" name="Jack" score="90"/>
</data>'''

# We can specify xpath to find wanted nodes. default ‘./*’
df = pd.read_xml(xml, xpath=".//row")

Result

id name score
0 1 Kevin 80
1 2 John NaN
2 3 Jack 90

How to read parquet file

We can use read_parquet of pandas to read content from parquet file.

Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem.

# import pandas module
import pandas as pd

# engine option is used to specify library of parquet.
# {‘auto’, ‘pyarrow’, ‘fastparquet’}, default ‘auto’
df = pd.read_xml('path/to/sample.parquet', engine="pyarrow")

Subscribe and be the FIRST reader of our latest articles

* indicates required

Contact us