load_dataset

load_dataset(dataset='small_table', tbl_type='polars')

Load a dataset hosted in the library as specified table type.

The Pointblank library includes several datasets that can be loaded using the load_dataset() function. The datasets can be loaded as a Polars DataFrame, a Pandas DataFrame, or as a DuckDB table (which uses the Ibis library backend). These datasets are used throughout the documentation’s examples to demonstrate the functionality of the library. They’re also useful for experimenting with the library and trying out different validation scenarios.

Parameters

dataset : Literal['small_table', 'game_revenue', 'nycflights', 'global_sales'] = 'small_table': The name of the dataset to load. Current options are "small_table", "game_revenue", "nycflights", and "global_sales".
tbl_type : Literal['polars', 'pandas', 'duckdb'] = 'polars': The type of table to generate from the dataset. The named options are "polars", "pandas", and "duckdb".

Returns

: FrameT | Any: The dataset for the Validate object. This could be a Polars DataFrame, a Pandas DataFrame, or a DuckDB table as an Ibis table.

Included Datasets

There are three included datasets that can be loaded using the load_dataset() function:

"small_table": A small dataset with 13 rows and 8 columns. This dataset is useful for testing and demonstration purposes.
"game_revenue": A dataset with 2000 rows and 11 columns. Provides revenue data for a game development company. For the particular game, there are records of player sessions, the items they purchased, ads viewed, and the revenue generated.
"nycflights": A dataset with 336,776 rows and 18 columns. This dataset provides information about flights departing from New York City airports (JFK, LGA, or EWR) in 2013.
"global_sales": A dataset with 50,000 rows and 20 columns. Provides information about global sales of products across different regions and countries.

Supported DataFrame Types

The tbl_type= parameter can be set to one of the following:

"polars": A Polars DataFrame.
"pandas": A Pandas DataFrame.
"duckdb": An Ibis table for a DuckDB database.

Examples

Load the "small_table" dataset as a Polars DataFrame by calling load_dataset() with dataset="small_table" and tbl_type="polars":

import pointblank as pb

small_table = pb.load_dataset(dataset="small_table", tbl_type="polars")

pb.preview(small_table)

	date_time Datetime	date Date	a Int64	b String	c Int64	d Float64	e Boolean	f String
PolarsRows13Columns8
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	True	low
3	2016-01-05 13:32:00	2016-01-05	6	8-kdg-938	3	2343.23	True	high
4	2016-01-06 17:23:00	2016-01-06	2	5-jdo-903	None	3892.4	False	mid
5	2016-01-09 12:36:00	2016-01-09	8	3-ldm-038	7	283.94	True	low
9	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
10	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
11	2016-01-26 20:07:00	2016-01-26	4	2-dmx-010	7	833.98	True	low
12	2016-01-28 02:51:00	2016-01-28	2	7-dmx-010	8	108.34	False	low
13	2016-01-30 11:23:00	2016-01-30	1	3-dka-303	None	2230.09	True	high

Note that the "small_table" dataset is a Polars DataFrame and using the preview() function will display the table in an HTML viewing environment.

The "game_revenue" dataset can be loaded as a Pandas DataFrame by specifying the dataset name and setting tbl_type="pandas":

game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="pandas")

pb.preview(game_revenue)

	player_id object	session_id object	session_start datetime64[ns, UTC]	time datetime64[ns, UTC]	item_type object	item_name object	item_revenue float64	session_duration float64	start_day datetime64[ns]	acquisition object	country object
PandasRows2,000Columns11
1	ECPANOIXLZHF896	ECPANOIXLZHF896-eol2j8bs	2015-01-01 01:31:03+00:00	2015-01-01 01:31:27+00:00	iap	offer2	8.99	16.3	2015-01-01 00:00:00	google	Germany
2	ECPANOIXLZHF896	ECPANOIXLZHF896-eol2j8bs	2015-01-01 01:31:03+00:00	2015-01-01 01:36:57+00:00	iap	gems3	22.49	16.3	2015-01-01 00:00:00	google	Germany
3	ECPANOIXLZHF896	ECPANOIXLZHF896-eol2j8bs	2015-01-01 01:31:03+00:00	2015-01-01 01:37:45+00:00	iap	gold7	107.99	16.3	2015-01-01 00:00:00	google	Germany
4	ECPANOIXLZHF896	ECPANOIXLZHF896-eol2j8bs	2015-01-01 01:31:03+00:00	2015-01-01 01:42:33+00:00	ad	ad_20sec	0.76	16.3	2015-01-01 00:00:00	google	Germany
5	ECPANOIXLZHF896	ECPANOIXLZHF896-hdu9jkls	2015-01-01 11:50:02+00:00	2015-01-01 11:55:20+00:00	ad	ad_5sec	0.03	35.2	2015-01-01 00:00:00	google	Germany
1996	NAOJRDMCSEBI281	NAOJRDMCSEBI281-j2vs9ilp	2015-01-21 01:57:50+00:00	2015-01-21 02:02:50+00:00	ad	ad_survey	1.332	25.8	2015-01-11 00:00:00	organic	Norway
1997	NAOJRDMCSEBI281	NAOJRDMCSEBI281-j2vs9ilp	2015-01-21 01:57:50+00:00	2015-01-21 02:22:14+00:00	ad	ad_survey	1.35	25.8	2015-01-11 00:00:00	organic	Norway
1998	RMOSWHJGELCI675	RMOSWHJGELCI675-vbhcsmtr	2015-01-21 02:39:48+00:00	2015-01-21 02:40:00+00:00	ad	ad_5sec	0.03	8.4	2015-01-10 00:00:00	other_campaign	France
1999	RMOSWHJGELCI675	RMOSWHJGELCI675-vbhcsmtr	2015-01-21 02:39:48+00:00	2015-01-21 02:47:12+00:00	iap	offer5	26.09	8.4	2015-01-10 00:00:00	other_campaign	France
2000	GJCXNTWEBIPQ369	GJCXNTWEBIPQ369-9elq67md	2015-01-21 03:59:23+00:00	2015-01-21 04:06:29+00:00	ad	ad_5sec	0.12	18.5	2015-01-14 00:00:00	organic	United States

The "game_revenue" dataset is a more real-world dataset with a mix of data types, and it’s significantly larger than the small_table dataset at 2000 rows and 11 columns.

The "nycflights" dataset can be loaded as a DuckDB table by specifying the dataset name and setting tbl_type="duckdb":

nycflights = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")

pb.preview(nycflights)

	year int64	month int64	day int64	dep_time int64	sched_dep_time int64	dep_delay int64	arr_time int64	sched_arr_time int64	arr_delay int64	carrier string	flight int64	tailnum string	origin string	dest string	air_time int64	distance int64	hour int64	minute int64
DuckDBRows336,776Columns18
1	2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15
2	2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29
3	2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40
4	2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45
5	2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0
336772	2013	9	30	NULL	1455	NULL	NULL	1634	NULL	9E	3393	NULL	JFK	DCA	NULL	213	14	55
336773	2013	9	30	NULL	2200	NULL	NULL	2312	NULL	9E	3525	NULL	LGA	SYR	NULL	198	22	0
336774	2013	9	30	NULL	1210	NULL	NULL	1330	NULL	MQ	3461	N535MQ	LGA	BNA	NULL	764	12	10
336775	2013	9	30	NULL	1159	NULL	NULL	1344	NULL	MQ	3572	N511MQ	LGA	CLE	NULL	419	11	59
336776	2013	9	30	NULL	840	NULL	NULL	1020	NULL	MQ	3531	N839MQ	LGA	RDU	NULL	431	8	40

The "nycflights" dataset is a large dataset with 336,776 rows and 18 columns. This dataset is truly a real-world dataset and provides information about flights originating from New York City airports in 2013.

Finally, the "global_sales" dataset can be loaded as a Polars table by specifying the dataset name. Since tbl_type= is set to "polars" by default, we don’t need to specify it:

global_sales = pb.load_dataset(dataset="global_sales")

pb.preview(global_sales)

	product_id String	product_category String	customer_id String	customer_segment String	region String	country String	city String	timestamp Datetime	quarter String	month Int64	year Int64	price Float64	quantity Int64	status String	email String	revenue Float64	tax Float64	total Float64	payment_method String	sales_channel String
PolarsRows50,000Columns20
1	98b70df0	Manufacturing	cf3b13c7	Government	Asia Pacific	Australia	Melbourne	2021-12-25 19:00:00	2021-Q4	12	2021	186.0	7	returned	user1651@test.org	1302.0	127.45	1429.45	Apple Pay	Partner
2	9d09fef5	Manufacturing	08b5db12	Consumer	Europe	France	Nice	2022-06-12 17:25:00	2022-Q2	6	2022	137.03	8	returned	user5200@company.io	1096.24	222.52	1318.76	PayPal	Distributor
3	8ac6b077	Retail	41079b2e	Consumer	Europe	France	Toulouse	2023-05-06 09:09:00	2023-Q2	5	2023	330.08	4	shipped	user9180@mockdata.com	1320.32	260.89	1581.21	PayPal	Phone
4	13d2df9d	Healthcare	b421eece	Consumer	North America	USA	Miami	2023-10-11 16:53:00	2023-Q4	10	2023	420.09	3	shipped	user1636@example.com	1260.27	103.99	1364.26	Bank Transfer	Phone
5	98b70df0	Manufacturing	5906a04f	SMB	North America	Canada	Calgary	2022-05-05 01:53:00	2022-Q2	5	2022	187.77	3	delivered	user9971@mockdata.com	563.31	75.73	639.04	Credit Card	Phone
49996	53a36468	Finance	966a8bbe	Government	Asia Pacific	Australia	Melbourne	2023-11-04 14:45:00	2023-Q4	11	2023	198.18	1	pending	user8593@test.org	198.18	18.3	216.48	Google Pay	Partner
49997	a42fd1ff	Healthcare	ff8933e4	SMB	Asia Pacific	Japan	Kyoto	2023-04-27 17:27:00	2023-Q2	4	2023	419.72	2	returned	user5448@company.io	839.44	90.49	929.93	Google Pay	Partner
49998	bbf158d2	Technology	f0c0af3f	Enterprise	North America	USA	Los Angeles	2021-04-24 23:15:00	2021-Q2	4	2021	302.52	1	pending	user1463@test.org	302.52	21.68	324.2	Bank Transfer	Online
49999	2a0866de	Healthcare	5b27ba59	SMB	Europe	France	Nice	2023-12-30 19:44:00	2023-Q4	12	2023	433.82	5	pending	user4167@test.org	2169.1	448.87	2617.97	Credit Card	Online
50000	6260f67c	Technology	482c1d84	Consumer	Asia Pacific	Japan	Kyoto	2021-12-05 09:49:00	2021-Q4	12	2021	400.31	8	returned	user4238@example.com	3202.48	339.84	3542.32	Apple Pay	Distributor

The "global_sales" dataset is a large dataset with 50,000 rows and 20 columns. Each record describes the sales of a particular product to a customer located in one of three global regions: North America, Europe, or Asia.