gators.feature_generation_str package#

Module contents#

class gators.feature_generation_str.CharacterStatistics[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates character-level statistical features from string columns.

Counts various character types (digits, letters, spaces, etc.) which are particularly useful for tree-based models to identify patterns in text data.

Parameters:

subset (list[str], default=None) – List of string columns to extract features from. If None, all string columns will be used.
features (list[str], default=["n_digits", "n_letters", "n_uppercase", "n_lowercase", "n_spaces", "n_special"]) –
Character statistics to generate. Options:
- ”n_digits”: Count of digit characters (0-9)
- ”n_letters”: Count of alphabetic characters (a-z, A-Z)
- ”n_uppercase”: Count of uppercase letters
- ”n_lowercase”: Count of lowercase letters
- ”n_spaces”: Count of space characters
- ”n_special”: Count of special characters (punctuation, symbols)
- ”n_unique_chars”: Count of unique characters
- ”ratio_uppercase”: Ratio of uppercase to total letters
- ”ratio_digits”: Ratio of digits to total length
- ”ratio_special”: Ratio of special chars to total length
drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.

Examples

>>> from gators.feature_generation_str import CharacterStatistics
>>> import polars as pl

>>> X =pl.DataFrame({
...     'text': ['Hello123', 'WORLD!!!', 'Test 99', ''],
...     'email': ['user@test.com', 'ADMIN@SITE.ORG', 'test', None]
... })

Example 1: Basic character counts

>>> transformer = CharacterStatistics(
...     subset=['text'],
...     features=['n_digits', 'n_letters', 'n_uppercase']
... )
>>> result = transformer.fit_transform(X)
>>> print(result)
shape: (4, 5)
┌──────────┬────────────────┬───────────────┬───────────────────┬─────────────────────┐
│ text     ┆ email          ┆ text__n_digit ┆ text__n_letters   ┆ text__n_uppercase   │
│ ---      ┆ ---            ┆ s             ┆ ---               ┆ ---                 │
│ str      ┆ str            ┆ ---           ┆ i64               ┆ i64                 │
│          ┆                ┆ i64           ┆                   ┆                     │
├──────────┼────────────────┼───────────────┼───────────────────┼─────────────────────┤
│ Hello123 ┆ user@test.com  ┆ 3             ┆ 5                 ┆ 1                   │
│ WORLD!!! ┆ ADMIN@SITE.ORG ┆ 0             ┆ 5                 ┆ 5                   │
│ Test 99  ┆ test           ┆ 2             ┆ 4                 ┆ 1                   │
│          ┆ null           ┆ 0             ┆ 0                 ┆ 0                   │
└──────────┴────────────────┴───────────────┴───────────────────┴─────────────────────┘

Example 2: Ratio features

>>> transformer = CharacterStatistics(
...     subset=['text', 'email'],
...     features=['ratio_uppercase', 'ratio_digits', 'ratio_special']
... )
>>> result = transformer.fit_transform(X)

Example 3: All features with drop_columns

>>> transformer = CharacterStatistics(
...     subset=['text'],
...     features=['n_digits', 'n_letters', 'n_spaces', 'n_special', 'n_unique_chars'],
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.character_statistics.CharacterStatistics[source]#

Fit the transformer by identifying string columns if not specified.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

CharacterStatistics

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating character statistics features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with character statistics features.
Return type:: pl.DataFrame

class gators.feature_generation_str.CombineFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Combines specific string/categorical columns to create composite key features (UID-like).

Unlike InteractionFeatures which generates all combinations, this transformer only creates the specific column combinations you provide, making it more efficient for creating unique identifiers or specific interaction features.

Parameters:

column_groups (list[list[str]]) – List of column groups to combine. Each group is a list of column names that will be concatenated together. Example: [[‘cat1’, ‘cat2’], [‘cat1’, ‘addr1’]]
separator (str, default='_') – String to use as separator when combining column values.
drop_columns (bool, default=False) – Whether to drop the original columns after creating combinations.
new_column_names (list[str], default=None) – List of custom names for the combined columns. If None, uses default naming pattern where columns are joined with ‘__’ (e.g., ‘cat1__cat2’). Must have same length as column_groups.

Examples

>>> from gators.feature_generation_str import CombineFeatures
>>> import polars as pl

>>> X ={
...     'cat1': ['A', 'A', 'B', 'B', 'A'],
...     'cat2': ['X', 'Y', 'X', 'Y', 'X'],
...     'addr1': ['US', 'US', 'UK', 'UK', 'CA'],
...     'amount': [100, 200, 150, 300, 250]
... }
>>> X = pl.DataFrame(X)

Example 1: Basic combination

>>> transformer = CombineFeatures(
...     column_groups=[['cat1', 'cat2']]
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (5, 5)
┌───────┬───────┬───────┬────────┬──────────────┐
│ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ cat1__cat2 │
│ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---          │
│ str   ┆ str   ┆ str   ┆ i64    ┆ str          │
╞═══════╪═══════╪═══════╪════════╪══════════════╡
│ A     ┆ X     ┆ US    ┆ 100    ┆ A_X          │
│ A     ┆ Y     ┆ US    ┆ 200    ┆ A_Y          │
│ B     ┆ X     ┆ UK    ┆ 150    ┆ B_X          │
│ B     ┆ Y     ┆ UK    ┆ 300    ┆ B_Y          │
│ A     ┆ X     ┆ CA    ┆ 250    ┆ A_X          │
└───────┴───────┴───────┴────────┴──────────────┘

Example 2: Multiple combinations

>>> transformer = CombineFeatures(
...     column_groups=[['cat1', 'cat2'], ['cat1', 'addr1']]
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (5, 6)
┌───────┬───────┬───────┬────────┬──────────────┬───────────────┐
│ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ cat1__cat2 ┆ cat1__addr1  │
│ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---          ┆ ---           │
│ str   ┆ str   ┆ str   ┆ i64    ┆ str          ┆ str           │
╞═══════╪═══════╪═══════╪════════╪══════════════╪═══════════════╡
│ A     ┆ X     ┆ US    ┆ 100    ┆ A_X          ┆ A_US          │
│ A     ┆ Y     ┆ US    ┆ 200    ┆ A_Y          ┆ A_US          │
│ B     ┆ X     ┆ UK    ┆ 150    ┆ B_X          ┆ B_UK          │
│ B     ┆ Y     ┆ UK    ┆ 300    ┆ B_Y          ┆ B_UK          │
│ A     ┆ X     ┆ CA    ┆ 250    ┆ A_X          ┆ A_CA          │
└───────┴───────┴───────┴────────┴──────────────┴───────────────┘

Example 3: Custom separator and column names

>>> transformer = CombineFeatures(
...     column_groups=[['cat1', 'cat2', 'addr1']],
...     separator='|',
...     new_column_names=['uid']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (5, 5)
┌───────┬───────┬───────┬────────┬──────────┐
│ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ uid      │
│ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---      │
│ str   ┆ str   ┆ str   ┆ i64    ┆ str      │
╞═══════╪═══════╪═══════╪════════╪══════════╡
│ A     ┆ X     ┆ US    ┆ 100    ┆ A|X|US   │
│ A     ┆ Y     ┆ US    ┆ 200    ┆ A|Y|US   │
│ B     ┆ X     ┆ UK    ┆ 150    ┆ B|X|UK   │
│ B     ┆ Y     ┆ UK    ┆ 300    ┆ B|Y|UK   │
│ A     ┆ X     ┆ CA    ┆ 250    ┆ A|X|CA   │
└───────┴───────┴───────┴────────┴──────────┘

Example 4: Creating UIDs for fraud detection

>>> # Create composite keys for unique user identification
>>> transformer = CombineFeatures(
...     column_groups=[
...         ['cat1', 'cat2', 'card3'],  # Card combination
...         ['cat1', 'addr1'],            # Card + address
...         ['email_domain', 'cat1']      # Email + card
...     ],
...     new_column_names=['card_uid', 'card_addr_uid', 'email_card_uid']
... )
>>> # Can then use these UIDs for frequency encoding or groupby operations

Notes

Null values are converted to string “null” before concatenation
All column values are cast to string before combining
Useful for creating unique identifiers (UIDs) for user/card tracking
More efficient than InteractionFeatures when you know exactly which combinations you need

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.combine_features.CombineFeatures[source]#

Fit the transformer by generating column name mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

CombineFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by combining categorical columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with combined categorical features.
Return type:: pl.DataFrame

class gators.feature_generation_str.Contains[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates Boolean columns indicating if substrings are contained within the original column values.

Examples

Create an instance of the Contains class:

>>> from gators.feature_generation_str import Contains
>>> contains_dict = {'col1': ['sub1', 'sub2'], 'col2': ['sub3']}
>>> transformer = Contains(contains_dict=contains_dict)

Fit the transformer:

>>> transformer.fit(X)

Transform the dataframe:

>>> X = pl.DataFrame({"col1": ["sub1 here", None, "sub2 here"],
...                    "col2": [None, "sub3 inside", "no match"]})
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (3, 5)
╭─────────────┬───────────────┬──────────────┬──────────────┬──────────────╮
│ col1        │ col2          │ col1__sub1   │ col1__sub2   │ col2__sub3   │
├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤
│ sub1 here   │ None          │ True         │ False        │ None         │
├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤
│ None        │ sub3 inside   │ None         │ None         │ False        │
├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤
│ sub2 here   │ no match      │ False        │ True         │ False        │
╰─────────────┴───────────────┴──────────────┴──────────────┴──────────────╯

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.contains.Contains[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Contains

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.Endswith[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates Boolean features indicating if substrings are at the end of the original column values.

Examples

Create an instance of the Endswith class:

>>> from gators.feature_feneration_str import Endswith
>>> endswith_dict = {'col1': ['end1', 'end2'], 'col2': ['end3']}
>>> transformer = Endswith(endswith_dict=endswith_dict)

Fit the transformer:

>>> transformer.fit(X)

Transform the dataframe:

>>> X = pl.DataFrame({"col1": ["this end1", None, "that end2"],
...                    "col2": [None, "one end3", "another no end"]})
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (3, 5)
╭─────────────┬───────────────┬─────────────────────┬─────────────────────┬─────────────────────╮
│ col1        │ col2          │ col1__endswith_end1 │ col1__endswith_end2 │ col2__endswith_end3 │
├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤
│ this end1   │ None          │ True                │ False               │ None                │
├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤
│ None        │ one end3      │ None                │ None                │ True                │
├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤
│ that end2   │ another no end│ False               │ True                │                False│
╰─────────────┴───────────────┴─────────────────────┴─────────────────────┴─────────────────────╯

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.endswith.Endswith[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Endswith

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.ExtractSubstring[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.extract_substring.ExtractSubstring[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

ExtractSubstring

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.InteractionFeatures[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates interaction features for categorical variables.

Parameters:

subset (list[str], default=None) – List of columns to consider for interaction.
degree (conint(gt=1), default=2) – Degree of interaction terms.

Examples

>>> import polars as pl
>>> from gators.feature_generation_str import InteractionFeatures

>>> # Sample data
>>> X =pl.DataFrame({
...     'A': ['cat', 'dog', 'cat', 'dog', 'cat'],
...     'B': ['x', 'x', 'y', 'y', 'x'],
...     'C': ['red', 'blue', 'green', 'blue', 'red']
... })

>>> # Interaction with default parameters (degree=2)
>>> interaction_features = InteractionFeatures()
>>> interaction_features.fit(X)
>>> transformed_X =interaction_features.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬─────┬──────┬─────────────┐
│ A   │ B   │ C    │ A__B        │
│ --- │ --- │ ---  │ ---         │
│ str │ str │ str  │ str         │
├─────┼─────┼──────┼─────────────┤
│ cat │ x   │ red  │ cat_x       │
│ dog │ x   │ blue │ dog_x       │
│ cat │ y   │ green│ cat_y       │
│ dog │ y   │ blue │ dog_y       │
│ cat │ x   │ red  │ cat_x       │
└─────┴─────┴──────┴─────────────┘

>>> # Interaction with degree=3
>>> interaction_features = InteractionFeatures(degree=3)
>>> interaction_features.fit(X)
>>> transformed_X =interaction_features.transform(X)
>>> print(transformed_X)
shape: (5, 8)
┌─────┬─────┬──────┬─────────────┬──────────────┬───────────────┬───────────────┐
│ A   │ B   │ C    │ A__B        │ A__C         │ B__C          │ A__B__C       │
│ --- │ --- │ ---  │ ---         │ ---          │ ---           │ ---           │
│ str │ str │ str  │ str         │ str          │ str           │ str           │
├─────┼─────┼──────┼─────────────┼──────────────┼───────────────┼───────────────┤
│ cat │ x   │ red  │ cat_x       │ cat_red      │ x_red         │ cat_x_red     │
│ dog │ x   │ blue │ dog_x       │ dog_blue     │ x_blue        │ dog_x_blue    │
│ cat │ y   │ green│ cat_y       │ cat_green    │ y_green       │ cat_y_green   │
│ dog │ y   │ blue │ dog_y       │ dog_blue     │ y_blue        │ dog_y_blue    │
│ cat │ x   │ red  │ cat_x       │ cat_red      │ x_red         │ cat_x_red     │
└─────┴─────┴──────┴─────────────┴──────────────┴───────────────┴───────────────┘

>>> # Interaction with columns=None
>>> interaction_features = InteractionFeatures(subset=['A', 'B'])
>>> interaction_features.fit(X)
>>> transformed_X =interaction_features.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬─────┬──────┬─────────────┐
│ A   │ B   │ C    │ A__B        │
│ --- │ --- │ ---  │ ---         │
│ str │ str │ str  │ str         │
├─────┼─────┼──────┼─────────────┤
│ cat │ x   │ red  │ cat_x       │
│ dog │ x   │ blue │ dog_x       │
│ cat │ y   │ green│ cat_y       │
│ dog │ y   │ blue │ dog_y       │
│ cat │ x   │ red  │ cat_x       │
└─────┴─────┴──────┴─────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.interaction_features.InteractionFeatures[source]#

Fit the transformer by identifying categorical columns if not specified.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

InteractionFeatures

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.Length[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates features based on the length of the variables.

Parameters:: subset (list[str], default=None) – List of columns to calculate length for.

Examples

>>> import polars as pl
>>> from gators.discretizers import Length

>>> # Sample data
>>> X =pl.DataFrame({
...     'A': ['cat', 'dog', 'cat', 'dog', 'cat'],
...     'B': ['yes', 'no', 'yes', 'no', 'yes'],
...     'C': ['quick', 'brown', 'fox', 'jumps', 'over']
... })

>>> # Calculate lengths with default parameters
>>> encoder = Length()
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 6)
┌─────┬──────┬───────┬──────────┬──────────┬───────────┐
│ A   │ B    │ C     │ A__length│ B__length│ C__length │
│ --- │ ---  │ ---   │ ---      │ ---      │ ---       │
│ str │ str  │ str   │ i64      │ i64      │ i64       │
├─────┼──────┼───────┼──────────┼──────────┼───────────┤
│ cat │ yes  │ quick │ 3        │ 3        │ 5         │
│ dog │ no   │ brown │ 3        │ 2        │ 5         │
│ cat │ yes  │ fox   │ 3        │ 3        │ 3         │
│ dog │ no   │ jumps │ 3        │ 2        │ 5         │
│ cat │ yes  │ over  │ 3        │ 3        │ 4         │
└─────┴──────┴───────┴──────────┴──────────┴───────────┘

>>> # Calculate lengths with columns as a subset
>>> encoder = Length(subset=['B'])
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬──────┬───────┬──────────┐
│ A   │ B    │ C     │ B__length│
│ --- │ ---  │ ---   │ ---      │
│ str │ str  │ str   │ i64      │
├─────┼──────┼───────┼──────────┤
│ cat │ yes  │ quick │ 3        │
│ dog │ no   │ brown │ 2        │
│ cat │ yes  │ fox   │ 3        │
│ dog │ no   │ jumps │ 2        │
│ cat │ yes  │ over  │ 3        │
└─────┴──────┴───────┴──────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.length.Length[source]#

Fit the transformer by identifying categorical columns and generating column mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Length

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.Lower[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Converts string and Boolean columns to lowercase.

Examples

Create an instance of the Lower class:

>>> from gators.feature_feneration_str import Lower
>>> lower = Lower(subset=["col1"], drop_columns=False)

Fit the transformer:

>>> lower.fit(X)

Transform the dataframe:

>>> X = pl.DataFrame({"col1": ["Hello", "WORLD"], "col2": [True, False]})
>>> transformed_X = lower.transform(X)
>>> print(transformed_X)
shape: (2, 3)
┌───────┬───────┬─────────────┐
│ col1  │ col2  │ col1__lower │
├───────┼───────┼─────────────┤
│ HeLLo │ True  │ hello       │
│ WORLD │ False │ world       │
└───────┴───────┴─────────────┘

If drop_columns is True, the original columns are dropped:

>>> lower.drop_columns = True
>>> transformed_X = lower.transform(X)
>>> print(transformed_X)
shape: (2, 1)
┌─────────────┐
│ col1__lower │
├─────────────┤
│ hello       │
│ world       │
└─────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.lower.Lower[source]#

Fit the transformer by identifying categorical columns and generating column mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Lower

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.NGram[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Extracts character or word n-grams from string columns.

Creates count features for the most common n-grams, useful for tree-based models to capture local text patterns and sequences.

Parameters:

subset (list[str], default=None) – List of string columns to extract n-grams from. If None, all string columns will be used.
n (int, default=2) – Size of n-grams to extract (e.g., 2 for bigrams, 3 for trigrams).
ngram_type (str, default="char") –
Type of n-grams to extract:
- ”char”: Character-level n-grams (e.g., “ab”, “bc” from “abc”)
- ”word”: Word-level n-grams (e.g., “hello world” as bigram)
max_features (int, default=10) – Maximum number of most common n-grams to extract per column. Top-k n-grams are selected during fit() based on frequency.
min_count (int, default=1) – Minimum number of occurrences for an n-gram to be included.
drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.

Examples

>>> from gators.feature_generation_str import NGram
>>> import polars as pl

>>> X =pl.DataFrame({
...     'text': ['hello world', 'hello there', 'world peace', None],
...     'desc': ['test data', 'test case', 'data case', '']
... })

Example 1: Character bigrams

>>> transformer = NGram(
...     subset=['text'],
...     n=2,
...     ngram_type='char',
...     max_features=5
... )
>>> result = transformer.fit_transform(X)
>>> print(result.columns)
['text', 'desc', 'text__ng_he', 'text__ng_el', 'text__ng_ll', 'text__ng_lo', 'text__ng_o_']
# Features for top-5 character bigrams: 'he', 'el', 'll', 'lo', 'o '

Example 2: Word bigrams

>>> transformer = NGram(
...     subset=['text'],
...     n=2,
...     ngram_type='word',
...     max_features=3
... )
>>> result = transformer.fit_transform(X)
# Features for word bigrams: 'hello world', 'hello there', 'world peace'

Example 3: Character trigrams with min_count

>>> transformer = NGram(
...     subset=['desc'],
...     n=3,
...     ngram_type='char',
...     max_features=5,
...     min_count=2
... )
>>> result = transformer.fit_transform(X)
# Only trigrams appearing at least 2 times are considered

Example 4: Multiple columns with drop

>>> transformer = NGram(
...     subset=['text', 'desc'],
...     n=2,
...     ngram_type='char',
...     max_features=10,
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.ngram.NGram[source]#

Fit the transformer by identifying top-k n-grams for each column.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

NGram

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating n-gram count features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with n-gram count features.
Return type:: pl.DataFrame

class gators.feature_generation_str.Occurrences[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Counts occurrences of specific substrings or characters in string columns.

Creates count features showing how many times a substring appears, useful for tree-based models to split on frequency patterns.

Parameters:

subset (list[str], default=None) – List of string columns to extract features from. If None, all string columns will be used.
substrings (dict[str, list[str]]) – Dictionary mapping column names to lists of substrings to count. For example: {“description”: [“error”, “warning”, “success”]} will create 3 count features for the “description” column.
case_sensitive (bool, default=False) – Whether substring matching should be case sensitive.
drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.

Examples

>>> from gators.feature_generation_str import Occurrences
>>> import polars as pl

>>> X =pl.DataFrame({
...     'log': ['Error: invalid input', 'Success: completed', 'Error: timeout Error', None],
...     'tags': ['#python #ml #data', '#python #java', '#ml #python', '']
... })

Example 1: Count specific keywords

>>> transformer = Occurrences(
...     subset=['log'],
...     substrings={'log': ['error', 'success', 'timeout']}
... )
>>> result = transformer.fit_transform(X)
>>> print(result)
shape: (4, 5)
┌───────────────────────────┬────────────────────┬─────────────┬────────────────┬────────────────┐
│ log                       ┆ tags               ┆ log__error  ┆ log__success   ┆ log__timeout   │
│ ---                       ┆ ---                ┆ ---         ┆ ---            ┆ ---            │
│ str                       ┆ str                ┆ i64         ┆ i64            ┆ i64            │
├───────────────────────────┼────────────────────┼─────────────┼────────────────┼────────────────┤
│ Error: invalid input      ┆ #python #ml #data  ┆ 1           ┆ 0              ┆ 0              │
│ Success: completed        ┆ #python #java      ┆ 0           ┆ 1              ┆ 0              │
│ Error: timeout Error      ┆ #ml #python        ┆ 2           ┆ 0              ┆ 1              │
│ null                      ┆                    ┆ 0           ┆ 0              ┆ 0              │
└───────────────────────────┴────────────────────┴─────────────┴────────────────┴────────────────┘

Example 2: Count hashtags (case sensitive)

>>> transformer = Occurrences(
...     subset=['tags'],
...     substrings={'tags': ['#python', '#ml', '#java', '#data']},
...     case_sensitive=True
... )
>>> result = transformer.fit_transform(X)

Example 3: Multiple columns with drop

>>> transformer = Occurrences(
...     subset=['log', 'tags'],
...     substrings={
...         'log': ['error', 'success'],
...         'tags': ['#python', '#ml']
...     },
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.occurrences.Occurrences[source]#

Fit the transformer by identifying string columns if not specified.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Occurrences

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by counting substring occurrences.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with substring count features.
Return type:: pl.DataFrame

class gators.feature_generation_str.PatternDetector[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Detects common patterns in string columns (emails, URLs, phone numbers, etc.).

Creates boolean features indicating whether strings match common patterns, useful for tree-based models to branch on data format and validity.

Parameters:

subset (list[str], default=None) – List of string columns to extract features from. If None, all string columns will be used.
patterns (list[str], default=["is_numeric", "is_email", "is_url", "is_phone"]) –
Patterns to detect. Options:
- ”is_numeric”: Contains only digits (with decimal/negative)
- ”is_email”: Matches email pattern (basic check)
- ”is_url”: Matches URL pattern (http/https)
- ”is_phone”: Matches phone number pattern
- ”is_alphanumeric”: Contains only letters and digits
- ”is_alpha”: Contains only letters
- ”has_http”: Contains http:// or https://
- ”has_www”: Contains www.
- ”has_at”: Contains @ symbol
drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.

Examples

>>> from gators.feature_generation_str import PatternDetector
>>> import polars as pl

>>> X =pl.DataFrame({
...     'contact': ['user@test.com', 'https://site.com', '555-1234', 'Hello World', None],
...     'code': ['ABC123', '999', 'test@email', 'XYZ', '']
... })

Example 1: Email and URL detection

>>> transformer = PatternDetector(
...     subset=['contact'],
...     patterns=['is_email', 'is_url', 'is_phone']
... )
>>> result = transformer.fit_transform(X)
>>> print(result)
shape: (5, 5)
┌────────────────────┬─────────┬───────────────────┬──────────────────┬───────────────────┐
│ contact            ┆ code    ┆ contact__is_email ┆ contact__is_url  ┆ contact__is_phone │
│ ---                ┆ ---     ┆ ---               ┆ ---              ┆ ---               │
│ str                ┆ str     ┆ bool              ┆ bool             ┆ bool              │
├────────────────────┼─────────┼───────────────────┼──────────────────┼───────────────────┤
│ user@test.com      ┆ ABC123  ┆ true              ┆ false            ┆ false             │
│ https://site.com   ┆ 999     ┆ false             ┆ true             ┆ false             │
│ 555-1234           ┆ test@e… ┆ false             ┆ false            ┆ true              │
│ Hello World        ┆ XYZ     ┆ false             ┆ false            ┆ false             │
│ null               ┆         ┆ false             ┆ false            ┆ false             │
└────────────────────┴─────────┴───────────────────┴──────────────────┴───────────────────┘

Example 2: Numeric and alphanumeric detection

>>> transformer = PatternDetector(
...     subset=['code'],
...     patterns=['is_numeric', 'is_alphanumeric', 'is_alpha']
... )
>>> result = transformer.fit_transform(X)

Example 3: URL component detection

>>> transformer = PatternDetector(
...     subset=['contact'],
...     patterns=['has_http', 'has_www', 'has_at'],
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.pattern_detector.PatternDetector[source]#

Fit the transformer by identifying string columns if not specified.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

PatternDetector

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by creating pattern detection features.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame with pattern detection features.
Return type:: pl.DataFrame

class gators.feature_generation_str.Split[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates features by splitting columns into multiple columns based on a delimiter.

The number of split columns is specified by the max_splits parameter to ensure consistent column structure between training and new data.

Parameters:

subset (list[str]) – List of column names to split.
by (str) – Delimiter to split the columns by.
max_splits (PositiveInt) – Maximum number of split columns to create. If a value has more splits, extra splits are truncated. If fewer, remaining columns are filled with empty strings.
drop_columns (bool, optional) – Whether to drop the original columns after splitting, by default True.

Examples

>>> from gators.feature_generation_str import Split
>>> import polars as pl

>>> X ={'full_name': ['John Doe', 'Jane Smith Williams', 'Alice Johnson']}
>>> X = pl.DataFrame(X)

Example 1: Split with max_splits=3 and drop_columns=True (default)

>>> transformer = Split(subset=['full_name'], by=' ', max_splits=3, drop_columns=True)
>>> transformer.fit(X)
Split(subset=['full_name'], by=' ', max_splits=3, drop_columns=True)
>>> result = transformer.transform(X)
>>> result
shape: (3, 3)
┌─────────────────────┬─────────────────────┬─────────────────────┐
│ full_name__split_0  │ full_name__split_1  │ full_name__split_2  │
│         str         │         str         │         str         │
├─────────────────────┼─────────────────────┼─────────────────────┤
│        John         │         Doe         │                     │
│        Jane         │        Smith        │      Williams       │
│       Alice         │       Johnson       │                     │
└─────────────────────┴─────────────────────┴─────────────────────┘

Example 2: Split with max_splits=2 and drop_columns=False

>>> transformer = Split(subset=['full_name'], by=' ', max_splits=2, drop_columns=False)
>>> transformer.fit(X)
Split(subset=['full_name'], by=' ', max_splits=2, drop_columns=False)
>>> result = transformer.transform(X)
>>> result
shape: (3, 3)
┌──────────────────────┬─────────────────────┬─────────────────────┐
│      full_name       │ full_name__split_0  │ full_name__split_1  │
│         str          │         str         │         str         │
├──────────────────────┼─────────────────────┼─────────────────────┤
│      John Doe        │        John         │         Doe         │
│  Jane Smith Williams │        Jane         │        Smith        │
│   Alice Johnson      │       Alice         │       Johnson       │
└──────────────────────┴─────────────────────┴─────────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.split.Split[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Split

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by splitting columns.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.SplitExtract[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates features by splitting columns and extracting a specific part by index.

Parameters:

subset (list[str]) – List of column names to split and extract from.
by (str) – Delimiter to split the columns by.
n (NonNegativeInt) – Index of the part to extract from the split (0-indexed).
drop_columns (bool, optional) – Whether to drop the original columns after splitting, by default True.

Examples

>>> from gators.feature_generation_str import SplitExtract
>>> import polars as pl

>>> X ={'full_name': ['John Doe', 'Jane Smith', 'Alice Johnson']}
>>> X = pl.DataFrame(X)

Example 1: Extract first part (n=0)

>>> transformer = SplitExtract(subset=['full_name'], by=' ', n=0, drop_columns=True)
>>> transformer.fit(X)
SplitExtract(subset=['full_name'], by=' ', n=0, drop_columns=True)
>>> result = transformer.transform(X)
>>> result
shape: (3, 1)
┌────────────────────┐
│ full_name__split_0 │
│         str        │
├────────────────────┤
│        John        │
│        Jane        │
│       Alice        │
└────────────────────┘

Example 2: Extract second part (n=1)

>>> transformer = SplitExtract(subset=['full_name'], by=' ', n=1, drop_columns=False)
>>> transformer.fit(X)
SplitExtract(subset=['full_name'], by=' ', n=1, drop_columns=False)
>>> result = transformer.transform(X)
>>> result
shape: (3, 2)
┌──────────────────┬────────────────────┐
│    full_name     │ full_name__split_1 │
│       str        │         str        │
├──────────────────┼────────────────────┤
│     John Doe     │        Doe         │
│    Jane Smith    │       Smith        │
│   Alice Johnson  │      Johnson       │
└──────────────────┴────────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.split_extract.SplitExtract[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

SplitExtract

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.Startswith[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Generates Boolean features to indicate if strings in the original columns start with specified substrings.

Examples

Create an instance of the Startswith class:

>>> from gators.feature_feneration_str import Startswith
>>> startswith_dict = {'col1': ['pre1', 'pre2'], 'col2': ['pre3']}
>>> transformer = Startswith(startswith_dict=startswith_dict)

Fit the transformer:

>>> transformer.fit(X)

Transform the DataFrame:

>>> X = pl.DataFrame({"col1": ["pre1_sample", None, "pre2_sample"],
...                     "col2": [None, "pre3_sample", "no_match"]})
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (3, 5)
┌───────────────┬─────────────┬───────────────────────┬───────────────────────┬───────────────────────┐
│ col1          │ col2        │ col1__startswith_pre1 │ col1__startswith_pre2 │ col2__startswith_pre3 │
├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤
│ pre1_sample   │ None        │ True                  │ False                 │ None                  │
├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤
│ None          │ pre3_sample │ None                  │ None                  │ True                  │
├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤
│ pre2_sample   │ no_match    │ False                 │ True                  │ False                 │
└───────────────┴─────────────┴───────────────────────┴───────────────────────┴───────────────────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.startswith.Startswith[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Startswith

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame

class gators.feature_generation_str.Upper[source]#

Bases: gators.transformer._base_transformer._BaseTransformer

Converts string and Boolean columns to uppercase.

Parameters:

subset (list[str], default=None) – List of columns to convert to uppercase.
drop_columns (bool, default=True) – Whether to drop original columns after transformation.

Examples

>>> import polars as pl
>>> from gators.discretizers import Upper

>>> # Sample data
>>> X =pl.DataFrame({
...     'A': ['cat', 'dog', 'cat', 'dog', 'cat'],
...     'B': ['yes', 'no', 'yes', 'no', 'yes'],
...     'C': ['quick', 'brown', 'fox', 'jumps', 'over']
... })

>>> # Transform to uppercase with default parameters (drop_columns=True)
>>> encoder = Upper()
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 3)
┌──────────┬──────────┬──────────┐
│ A__upper │ B__upper │ C__upper │
│ ---      │ ---      │ ---      │
│ str      │ str      │ str      │
├──────────┼──────────┼──────────┤
│ CAT      │ YES      │ QUICK    │
│ DOG      │ NO       │ BROWN    │
│ CAT      │ YES      │ FOX      │
│ DOG      │ NO       │ JUMPS    │
│ CAT      │ YES      │ OVER     │
└──────────┴──────────┴──────────┘

>>> # Transform to uppercase with drop_columns=False
>>> encoder = Upper(drop_columns=False)
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 6)
┌─────┬──────┬───────┬──────────┬──────────┬──────────┐
│ A   │ B    │ C     │ A__upper │ B__upper │ C__upper │
│ --- │ ---  │ ---   │ ---      │ ---      │ ---      │
│ str │ str  │ str   │ str      │ str      │ str      │
├─────┼──────┼───────┼──────────┼──────────┼──────────┤
│ cat │ yes  │ quick │ CAT      │ YES      │ QUICK    │
│ dog │ no   │ brown │ DOG      │ NO       │ BROWN    │
│ cat │ yes  │ fox   │ CAT      │ YES      │ FOX      │
│ dog │ no   │ jumps │ DOG      │ NO       │ JUMPS    │
│ cat │ yes  │ over  │ CAT      │ YES      │ OVER     │
└─────┴──────┴───────┴──────────┴──────────┴──────────┘

>>> # Transform to uppercase with columns as a subset
>>> encoder = Upper(subset=['B'])
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬──────┬───────┬──────────┐
│ A   │ B    │ C     │ B__upper │
│ --- │ ---  │ ---   │ ---      │
│ str │ str  │ str   │ str      │
├─────┼──────┼───────┼──────────┤
│ cat │ yes  │ quick │ YES      │
│ dog │ no   │ brown │ NO       │
│ cat │ yes  │ fox   │ YES      │
│ dog │ no   │ jumps │ NO       │
│ cat │ yes  │ over  │ YES      │
└─────┴──────┴───────┴──────────┘

fit(X: polars.DataFrame, y: polars.Series | None = None) → gators.feature_generation_str.upper.Upper[source]#

Fit the transformer by identifying categorical columns and generating column mappings.

Parameters:

X (pl.DataFrame) – Input DataFrame.
y (pl.Series, default=None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Upper

transform(X: polars.DataFrame) → polars.DataFrame[source]#

Transform the input DataFrame by extracting specified components.

Parameters:: X (pl.DataFrame) – Input DataFrame to transform.
Returns:: Transformed DataFrame.
Return type:: pl.DataFrame