torchtable.dataset module

Module contents

class torchtable.dataset.TabularDataset(examples: Dict[Union[str, List[str]], Union[T, Iterable[T]]], fields: torchtable.dataset.core.FieldDict, train=True)

Bases: torch.utils.data.dataset.Dataset

A dataset for tabular data.

Parameters:
  • fields – A dictionary mapping from a column/columns in the raw data to a Field/Fields. To specify multiple columns as input, use a tuple of column names. To map a single column to multiple fields, use a list of fields. Each field will be mapped to a single entry in the processed dataset.
  • train – Whether this dataset is the training set. This affects whether the fields will fit the given data.
classmethod from_csv(fname: str, fields: Dict[Union[str, List[str]], Union[torchtable.field.core.Field, torchtable.field.core.FieldCollection, Collection[torchtable.field.core.Field]]], train=True, csv_read_params: dict = {}) → torchtable.dataset.core.TabularDataset

Initialize a dataset from a csv file. See documentation on TabularDataset.from_df for more details on arguments. :param csv_read_params: Keyword arguments to pass to the pd.read_csv method.

classmethod from_df(df: pandas.core.frame.DataFrame, fields: Dict[Union[str, List[str]], Union[torchtable.field.core.Field, torchtable.field.core.FieldCollection, Collection[torchtable.field.core.Field]]], train=True) → torchtable.dataset.core.TabularDataset

Initialize a dataset from a pandas dataframe.

Parameters:
  • df – pandas dataframe to initialize from
  • fields – Dictionary mapping from a column identifier to a field or fields. The key can be a single column name or a tuple of multiple columns. The column(s) specified by the key will be passed to the field(s) transform method. The value can be a single field, a list/tuple of fields, or a field.FieldCollection. In general, each example in the dataset will mirror the structure of the fields passed. For instance, if you pass multiple fields for a certain key, the example will also have multiple outputs for the given key structured as a list. If you want a flat dictionary for the example, consider using the flatten attribute in the field.FieldCollection class (see field.FieldCollection documentation for more details).
  • train – Whether this dataset is the training set. This affects whether the fields will fit the given data.

Example

>>> ds = TabularDataset.from_df(df, fields={
...     "authorized_flag": CategoricalField(handle_unk=False), # standard field
...     "card_id": [CategoricalField(handle_unk=True),
...                 Field(LambdaOperator(lambda x: x.str[0]) > Categorize())], # multiple fields and custom fields
...     "price": NumericField(fill_missing=None, normalization=None, is_target=True), # target field
...     ("authorized_flag", "price"): Field(LambdaOperator(
...             lambda x: (x["authorized_flag"] == "N").astype("int") * x["price"])), # multiple column field
... })
>>> ds[0]
{"authorized_flag": 0,
 "card_id": [1, 0],
  "price": 1.2,
  ("authorized_flag", "price"): 0.}
classmethod from_dfs(train_df: pandas.core.frame.DataFrame, val_df: pandas.core.frame.DataFrame = None, test_df: pandas.core.frame.DataFrame = None, fields: Dict[Union[str, List[str]], Union[torchtable.field.core.Field, torchtable.field.core.FieldCollection, Collection[torchtable.field.core.Field]]] = None) → Iterable[torchtable.dataset.core.TabularDataset]

Generates datasets from train, val, and test dataframes. .. rubric:: Example

>>> trn, val, test = TabularDataset.from_dfs(train_df, val_df=val_df, test_df=test_df, fields={
...   "a": NumericField(), "b": CategoricalField(),
...  })