torchtable.dataset module¶
Module contents¶
-
class
torchtable.dataset.
TabularDataset
(examples: Dict[Union[str, List[str]], Union[T, Iterable[T]]], fields: torchtable.dataset.core.FieldDict, train=True)¶ Bases:
torch.utils.data.dataset.Dataset
A dataset for tabular data.
Parameters: - fields – A dictionary mapping from a column/columns in the raw data to a Field/Fields. To specify multiple columns as input, use a tuple of column names. To map a single column to multiple fields, use a list of fields. Each field will be mapped to a single entry in the processed dataset.
- train – Whether this dataset is the training set. This affects whether the fields will fit the given data.
-
classmethod
from_csv
(fname: str, fields: Dict[Union[str, List[str]], Union[torchtable.field.core.Field, torchtable.field.core.FieldCollection, Collection[torchtable.field.core.Field]]], train=True, csv_read_params: dict = {}) → torchtable.dataset.core.TabularDataset¶ Initialize a dataset from a csv file. See documentation on TabularDataset.from_df for more details on arguments. :param csv_read_params: Keyword arguments to pass to the pd.read_csv method.
-
classmethod
from_df
(df: pandas.core.frame.DataFrame, fields: Dict[Union[str, List[str]], Union[torchtable.field.core.Field, torchtable.field.core.FieldCollection, Collection[torchtable.field.core.Field]]], train=True) → torchtable.dataset.core.TabularDataset¶ Initialize a dataset from a pandas dataframe.
Parameters: - df – pandas dataframe to initialize from
- fields – Dictionary mapping from a column identifier to a field or fields. The key can be a single column name or a tuple of multiple columns. The column(s) specified by the key will be passed to the field(s) transform method. The value can be a single field, a list/tuple of fields, or a field.FieldCollection. In general, each example in the dataset will mirror the structure of the fields passed. For instance, if you pass multiple fields for a certain key, the example will also have multiple outputs for the given key structured as a list. If you want a flat dictionary for the example, consider using the flatten attribute in the field.FieldCollection class (see field.FieldCollection documentation for more details).
- train – Whether this dataset is the training set. This affects whether the fields will fit the given data.
Example
>>> ds = TabularDataset.from_df(df, fields={ ... "authorized_flag": CategoricalField(handle_unk=False), # standard field ... "card_id": [CategoricalField(handle_unk=True), ... Field(LambdaOperator(lambda x: x.str[0]) > Categorize())], # multiple fields and custom fields ... "price": NumericField(fill_missing=None, normalization=None, is_target=True), # target field ... ("authorized_flag", "price"): Field(LambdaOperator( ... lambda x: (x["authorized_flag"] == "N").astype("int") * x["price"])), # multiple column field ... }) >>> ds[0] {"authorized_flag": 0, "card_id": [1, 0], "price": 1.2, ("authorized_flag", "price"): 0.}
-
classmethod
from_dfs
(train_df: pandas.core.frame.DataFrame, val_df: pandas.core.frame.DataFrame = None, test_df: pandas.core.frame.DataFrame = None, fields: Dict[Union[str, List[str]], Union[torchtable.field.core.Field, torchtable.field.core.FieldCollection, Collection[torchtable.field.core.Field]]] = None) → Iterable[torchtable.dataset.core.TabularDataset]¶ Generates datasets from train, val, and test dataframes. .. rubric:: Example
>>> trn, val, test = TabularDataset.from_dfs(train_df, val_df=val_df, test_df=test_df, fields={ ... "a": NumericField(), "b": CategoricalField(), ... })