gators.feature_generation_str package#

Module contents#

class gators.feature_generation_str.CharacterStatistics[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates character-level statistical features from string columns.

Counts various character types (digits, letters, spaces, etc.) which are particularly useful for tree-based models to identify patterns in text data.

Parameters:
  • subset (Optional[List[str]], default=None) – List of string columns to extract features from. If None, all string columns will be used.

  • features (List[str], default=["n_digits", "n_letters", "n_uppercase", "n_lowercase", "n_spaces", "n_special"]) –

    Character statistics to generate. Options:

    • ”n_digits”: Count of digit characters (0-9)

    • ”n_letters”: Count of alphabetic characters (a-z, A-Z)

    • ”n_uppercase”: Count of uppercase letters

    • ”n_lowercase”: Count of lowercase letters

    • ”n_spaces”: Count of space characters

    • ”n_special”: Count of special characters (punctuation, symbols)

    • ”n_unique_chars”: Count of unique characters

    • ”ratio_uppercase”: Ratio of uppercase to total letters

    • ”ratio_digits”: Ratio of digits to total length

    • ”ratio_special”: Ratio of special chars to total length

  • drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.

Examples

>>> from gators.feature_generation_str import CharacterStatistics
>>> import polars as pl
>>> X =pl.DataFrame({
...     'text': ['Hello123', 'WORLD!!!', 'Test 99', ''],
...     'email': ['user@test.com', 'ADMIN@SITE.ORG', 'test', None]
... })

Example 1: Basic character counts

>>> transformer = CharacterStatistics(
...     subset=['text'],
...     features=['n_digits', 'n_letters', 'n_uppercase']
... )
>>> result = transformer.fit_transform(X)
>>> print(result)
shape: (4, 5)
┌──────────┬────────────────┬───────────────┬───────────────────┬─────────────────────┐
│ text     ┆ email          ┆ text__n_digit ┆ text__n_letters   ┆ text__n_uppercase   │
│ ---      ┆ ---            ┆ s             ┆ ---               ┆ ---                 │
│ str      ┆ str            ┆ ---           ┆ i64               ┆ i64                 │
│          ┆                ┆ i64           ┆                   ┆                     │
├──────────┼────────────────┼───────────────┼───────────────────┼─────────────────────┤
│ Hello123 ┆ user@test.com  ┆ 3             ┆ 5                 ┆ 1                   │
│ WORLD!!! ┆ ADMIN@SITE.ORG ┆ 0             ┆ 5                 ┆ 5                   │
│ Test 99  ┆ test           ┆ 2             ┆ 4                 ┆ 1                   │
│          ┆ null           ┆ 0             ┆ 0                 ┆ 0                   │
└──────────┴────────────────┴───────────────┴───────────────────┴─────────────────────┘

Example 2: Ratio features

>>> transformer = CharacterStatistics(
...     subset=['text', 'email'],
...     features=['ratio_uppercase', 'ratio_digits', 'ratio_special']
... )
>>> result = transformer.fit_transform(X)

Example 3: All features with drop_columns

>>> transformer = CharacterStatistics(
...     subset=['text'],
...     features=['n_digits', 'n_letters', 'n_spaces', 'n_special', 'n_unique_chars'],
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)
fit(X, y=None)[source]#

Fit the transformer by identifying string columns if not specified.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

CharacterStatistics

transform(X)[source]#

Transform the input DataFrame by creating character statistics features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with character statistics features.

Return type:

DataFrame

class gators.feature_generation_str.CombineFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Combines specific string/categorical columns to create composite key features (UID-like).

Unlike InteractionFeatures which generates all combinations, this transformer only creates the specific column combinations you provide, making it more efficient for creating unique identifiers or specific interaction features.

Parameters:
  • column_groups (List[List[str]]) – List of column groups to combine. Each group is a list of column names that will be concatenated together. Example: [[‘cat1’, ‘cat2’], [‘cat1’, ‘addr1’]]

  • separator (str, default='_') – String to use as separator when combining column values.

  • drop_columns (bool, default=False) – Whether to drop the original columns after creating combinations.

  • new_column_names (Optional[List[str]], default=None) – List of custom names for the combined columns. If None, uses default naming pattern where columns are joined with ‘__’ (e.g., ‘cat1__cat2’). Must have same length as column_groups.

Examples

>>> from gators.feature_generation_str import CombineFeatures
>>> import polars as pl
>>> X ={
...     'cat1': ['A', 'A', 'B', 'B', 'A'],
...     'cat2': ['X', 'Y', 'X', 'Y', 'X'],
...     'addr1': ['US', 'US', 'UK', 'UK', 'CA'],
...     'amount': [100, 200, 150, 300, 250]
... }
>>> X = pl.DataFrame(X)

Example 1: Basic combination

>>> transformer = CombineFeatures(
...     column_groups=[['cat1', 'cat2']]
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (5, 5)
┌───────┬───────┬───────┬────────┬──────────────┐
│ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ cat1__cat2 │
│ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---          │
│ str   ┆ str   ┆ str   ┆ i64    ┆ str          │
╞═══════╪═══════╪═══════╪════════╪══════════════╡
│ A     ┆ X     ┆ US    ┆ 100    ┆ A_X          │
│ A     ┆ Y     ┆ US    ┆ 200    ┆ A_Y          │
│ B     ┆ X     ┆ UK    ┆ 150    ┆ B_X          │
│ B     ┆ Y     ┆ UK    ┆ 300    ┆ B_Y          │
│ A     ┆ X     ┆ CA    ┆ 250    ┆ A_X          │
└───────┴───────┴───────┴────────┴──────────────┘

Example 2: Multiple combinations

>>> transformer = CombineFeatures(
...     column_groups=[['cat1', 'cat2'], ['cat1', 'addr1']]
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (5, 6)
┌───────┬───────┬───────┬────────┬──────────────┬───────────────┐
│ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ cat1__cat2 ┆ cat1__addr1  │
│ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---          ┆ ---           │
│ str   ┆ str   ┆ str   ┆ i64    ┆ str          ┆ str           │
╞═══════╪═══════╪═══════╪════════╪══════════════╪═══════════════╡
│ A     ┆ X     ┆ US    ┆ 100    ┆ A_X          ┆ A_US          │
│ A     ┆ Y     ┆ US    ┆ 200    ┆ A_Y          ┆ A_US          │
│ B     ┆ X     ┆ UK    ┆ 150    ┆ B_X          ┆ B_UK          │
│ B     ┆ Y     ┆ UK    ┆ 300    ┆ B_Y          ┆ B_UK          │
│ A     ┆ X     ┆ CA    ┆ 250    ┆ A_X          ┆ A_CA          │
└───────┴───────┴───────┴────────┴──────────────┴───────────────┘

Example 3: Custom separator and column names

>>> transformer = CombineFeatures(
...     column_groups=[['cat1', 'cat2', 'addr1']],
...     separator='|',
...     new_column_names=['uid']
... )
>>> result = transformer.fit_transform(X)
>>> result
shape: (5, 5)
┌───────┬───────┬───────┬────────┬──────────┐
│ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ uid      │
│ ---   ┆ ---   ┆ ---   ┆ ---    ┆ ---      │
│ str   ┆ str   ┆ str   ┆ i64    ┆ str      │
╞═══════╪═══════╪═══════╪════════╪══════════╡
│ A     ┆ X     ┆ US    ┆ 100    ┆ A|X|US   │
│ A     ┆ Y     ┆ US    ┆ 200    ┆ A|Y|US   │
│ B     ┆ X     ┆ UK    ┆ 150    ┆ B|X|UK   │
│ B     ┆ Y     ┆ UK    ┆ 300    ┆ B|Y|UK   │
│ A     ┆ X     ┆ CA    ┆ 250    ┆ A|X|CA   │
└───────┴───────┴───────┴────────┴──────────┘

Example 4: Creating UIDs for fraud detection

>>> # Create composite keys for unique user identification
>>> transformer = CombineFeatures(
...     column_groups=[
...         ['cat1', 'cat2', 'card3'],  # Card combination
...         ['cat1', 'addr1'],            # Card + address
...         ['email_domain', 'cat1']      # Email + card
...     ],
...     new_column_names=['card_uid', 'card_addr_uid', 'email_card_uid']
... )
>>> # Can then use these UIDs for frequency encoding or groupby operations

Notes

  • Null values are converted to string “null” before concatenation

  • All column values are cast to string before combining

  • Useful for creating unique identifiers (UIDs) for user/card tracking

  • More efficient than InteractionFeatures when you know exactly which combinations you need

fit(X, y=None)[source]#

Fit the transformer by generating column name mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

CombineFeatures

transform(X)[source]#

Transform the input DataFrame by combining categorical columns.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with combined categorical features.

Return type:

DataFrame

class gators.feature_generation_str.Contains[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates Boolean columns indicating if substrings are contained within the original column values.

Examples

Create an instance of the Contains class:

>>> from gators.feature_generation_str import Contains
>>> contains_dict = {'col1': ['sub1', 'sub2'], 'col2': ['sub3']}
>>> transformer = Contains(contains_dict=contains_dict)

Fit the transformer:

>>> transformer.fit(X)

Transform the dataframe:

>>> X = pl.DataFrame({"col1": ["sub1 here", None, "sub2 here"],
...                    "col2": [None, "sub3 inside", "no match"]})
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (3, 5)
╭─────────────┬───────────────┬──────────────┬──────────────┬──────────────╮
│ col1        │ col2          │ col1__sub1   │ col1__sub2   │ col2__sub3   │
├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤
│ sub1 here   │ None          │ True         │ False        │ None         │
├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤
│ None        │ sub3 inside   │ None         │ None         │ False        │
├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤
│ sub2 here   │ no match      │ False        │ True         │ False        │
╰─────────────┴───────────────┴──────────────┴──────────────┴──────────────╯
fit(X, y=None)[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Contains

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.Endswith[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates Boolean features indicating if substrings are at the end of the original column values.

Examples

Create an instance of the Endswith class:

>>> from gators.feature_feneration_str import Endswith
>>> endswith_dict = {'col1': ['end1', 'end2'], 'col2': ['end3']}
>>> transformer = Endswith(endswith_dict=endswith_dict)

Fit the transformer:

>>> transformer.fit(X)

Transform the dataframe:

>>> X = pl.DataFrame({"col1": ["this end1", None, "that end2"],
...                    "col2": [None, "one end3", "another no end"]})
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (3, 5)
╭─────────────┬───────────────┬─────────────────────┬─────────────────────┬─────────────────────╮
│ col1        │ col2          │ col1__endswith_end1 │ col1__endswith_end2 │ col2__endswith_end3 │
├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤
│ this end1   │ None          │ True                │ False               │ None                │
├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤
│ None        │ one end3      │ None                │ None                │ True                │
├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤
│ that end2   │ another no end│ False               │ True                │                False│
╰─────────────┴───────────────┴─────────────────────┴─────────────────────┴─────────────────────╯
fit(X, y=None)[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Endswith

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.ExtractSubstring[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

fit(X, y=None)[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

ExtractSubstring

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.InteractionFeatures[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates interaction features for categorical variables.

Parameters:
  • subset (Optional[List[str]], default=None) – List of columns to consider for interaction.

  • degree (conint(gt=1), default=2) – Degree of interaction terms.

Examples

>>> import polars as pl
>>> from gators.feature_generation_str import InteractionFeatures
>>> # Sample data
>>> X =pl.DataFrame({
...     'A': ['cat', 'dog', 'cat', 'dog', 'cat'],
...     'B': ['x', 'x', 'y', 'y', 'x'],
...     'C': ['red', 'blue', 'green', 'blue', 'red']
... })
>>> # Interaction with default parameters (degree=2)
>>> interaction_features = InteractionFeatures()
>>> interaction_features.fit(X)
>>> transformed_X =interaction_features.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬─────┬──────┬─────────────┐
│ A   │ B   │ C    │ A__B        │
│ --- │ --- │ ---  │ ---         │
│ str │ str │ str  │ str         │
├─────┼─────┼──────┼─────────────┤
│ cat │ x   │ red  │ cat_x       │
│ dog │ x   │ blue │ dog_x       │
│ cat │ y   │ green│ cat_y       │
│ dog │ y   │ blue │ dog_y       │
│ cat │ x   │ red  │ cat_x       │
└─────┴─────┴──────┴─────────────┘
>>> # Interaction with degree=3
>>> interaction_features = InteractionFeatures(degree=3)
>>> interaction_features.fit(X)
>>> transformed_X =interaction_features.transform(X)
>>> print(transformed_X)
shape: (5, 8)
┌─────┬─────┬──────┬─────────────┬──────────────┬───────────────┬───────────────┐
│ A   │ B   │ C    │ A__B        │ A__C         │ B__C          │ A__B__C       │
│ --- │ --- │ ---  │ ---         │ ---          │ ---           │ ---           │
│ str │ str │ str  │ str         │ str          │ str           │ str           │
├─────┼─────┼──────┼─────────────┼──────────────┼───────────────┼───────────────┤
│ cat │ x   │ red  │ cat_x       │ cat_red      │ x_red         │ cat_x_red     │
│ dog │ x   │ blue │ dog_x       │ dog_blue     │ x_blue        │ dog_x_blue    │
│ cat │ y   │ green│ cat_y       │ cat_green    │ y_green       │ cat_y_green   │
│ dog │ y   │ blue │ dog_y       │ dog_blue     │ y_blue        │ dog_y_blue    │
│ cat │ x   │ red  │ cat_x       │ cat_red      │ x_red         │ cat_x_red     │
└─────┴─────┴──────┴─────────────┴──────────────┴───────────────┴───────────────┘
>>> # Interaction with columns=None
>>> interaction_features = InteractionFeatures(subset=['A', 'B'])
>>> interaction_features.fit(X)
>>> transformed_X =interaction_features.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬─────┬──────┬─────────────┐
│ A   │ B   │ C    │ A__B        │
│ --- │ --- │ ---  │ ---         │
│ str │ str │ str  │ str         │
├─────┼─────┼──────┼─────────────┤
│ cat │ x   │ red  │ cat_x       │
│ dog │ x   │ blue │ dog_x       │
│ cat │ y   │ green│ cat_y       │
│ dog │ y   │ blue │ dog_y       │
│ cat │ x   │ red  │ cat_x       │
└─────┴─────┴──────┴─────────────┘
fit(X, y=None)[source]#

Fit the transformer by identifying categorical columns if not specified.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

InteractionFeatures

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.Length[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates features based on the length of the variables.

Parameters:

subset (Optional[List[str]], default=None) – List of columns to calculate length for.

Examples

>>> import polars as pl
>>> from gators.discretizers import Length
>>> # Sample data
>>> X =pl.DataFrame({
...     'A': ['cat', 'dog', 'cat', 'dog', 'cat'],
...     'B': ['yes', 'no', 'yes', 'no', 'yes'],
...     'C': ['quick', 'brown', 'fox', 'jumps', 'over']
... })
>>> # Calculate lengths with default parameters
>>> encoder = Length()
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 6)
┌─────┬──────┬───────┬──────────┬──────────┬───────────┐
│ A   │ B    │ C     │ A__length│ B__length│ C__length │
│ --- │ ---  │ ---   │ ---      │ ---      │ ---       │
│ str │ str  │ str   │ i64      │ i64      │ i64       │
├─────┼──────┼───────┼──────────┼──────────┼───────────┤
│ cat │ yes  │ quick │ 3        │ 3        │ 5         │
│ dog │ no   │ brown │ 3        │ 2        │ 5         │
│ cat │ yes  │ fox   │ 3        │ 3        │ 3         │
│ dog │ no   │ jumps │ 3        │ 2        │ 5         │
│ cat │ yes  │ over  │ 3        │ 3        │ 4         │
└─────┴──────┴───────┴──────────┴──────────┴───────────┘
>>> # Calculate lengths with columns as a subset
>>> encoder = Length(subset=['B'])
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬──────┬───────┬──────────┐
│ A   │ B    │ C     │ B__length│
│ --- │ ---  │ ---   │ ---      │
│ str │ str  │ str   │ i64      │
├─────┼──────┼───────┼──────────┤
│ cat │ yes  │ quick │ 3        │
│ dog │ no   │ brown │ 2        │
│ cat │ yes  │ fox   │ 3        │
│ dog │ no   │ jumps │ 2        │
│ cat │ yes  │ over  │ 3        │
└─────┴──────┴───────┴──────────┘
fit(X, y=None)[source]#

Fit the transformer by identifying categorical columns and generating column mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Length

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.Lower[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Converts string and Boolean columns to lowercase.

Examples

Create an instance of the Lower class:

>>> from gators.feature_feneration_str import Lower
>>> lower = Lower(subset=["col1"], drop_columns=False)

Fit the transformer:

>>> lower.fit(X)

Transform the dataframe:

>>> X = pl.DataFrame({"col1": ["Hello", "WORLD"], "col2": [True, False]})
>>> transformed_X = lower.transform(X)
>>> print(transformed_X)
shape: (2, 3)
┌───────┬───────┬─────────────┐
│ col1  │ col2  │ col1__lower │
├───────┼───────┼─────────────┤
│ HeLLo │ True  │ hello       │
│ WORLD │ False │ world       │
└───────┴───────┴─────────────┘

If drop_columns is True, the original columns are dropped:

>>> lower.drop_columns = True
>>> transformed_X = lower.transform(X)
>>> print(transformed_X)
shape: (2, 1)
┌─────────────┐
│ col1__lower │
├─────────────┤
│ hello       │
│ world       │
└─────────────┘
fit(X, y=None)[source]#

Fit the transformer by identifying categorical columns and generating column mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Lower

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.NGram[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Extracts character or word n-grams from string columns.

Creates count features for the most common n-grams, useful for tree-based models to capture local text patterns and sequences.

Parameters:
  • subset (Optional[List[str]], default=None) – List of string columns to extract n-grams from. If None, all string columns will be used.

  • n (int, default=2) – Size of n-grams to extract (e.g., 2 for bigrams, 3 for trigrams).

  • ngram_type (str, default="char") –

    Type of n-grams to extract:

    • ”char”: Character-level n-grams (e.g., “ab”, “bc” from “abc”)

    • ”word”: Word-level n-grams (e.g., “hello world” as bigram)

  • max_features (int, default=10) – Maximum number of most common n-grams to extract per column. Top-k n-grams are selected during fit() based on frequency.

  • min_count (int, default=1) – Minimum number of occurrences for an n-gram to be included.

  • drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.

Examples

>>> from gators.feature_generation_str import NGram
>>> import polars as pl
>>> X =pl.DataFrame({
...     'text': ['hello world', 'hello there', 'world peace', None],
...     'desc': ['test data', 'test case', 'data case', '']
... })

Example 1: Character bigrams

>>> transformer = NGram(
...     subset=['text'],
...     n=2,
...     ngram_type='char',
...     max_features=5
... )
>>> result = transformer.fit_transform(X)
>>> print(result.columns)
['text', 'desc', 'text__ng_he', 'text__ng_el', 'text__ng_ll', 'text__ng_lo', 'text__ng_o_']
# Features for top-5 character bigrams: 'he', 'el', 'll', 'lo', 'o '

Example 2: Word bigrams

>>> transformer = NGram(
...     subset=['text'],
...     n=2,
...     ngram_type='word',
...     max_features=3
... )
>>> result = transformer.fit_transform(X)
# Features for word bigrams: 'hello world', 'hello there', 'world peace'

Example 3: Character trigrams with min_count

>>> transformer = NGram(
...     subset=['desc'],
...     n=3,
...     ngram_type='char',
...     max_features=5,
...     min_count=2
... )
>>> result = transformer.fit_transform(X)
# Only trigrams appearing at least 2 times are considered

Example 4: Multiple columns with drop

>>> transformer = NGram(
...     subset=['text', 'desc'],
...     n=2,
...     ngram_type='char',
...     max_features=10,
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)
fit(X, y=None)[source]#

Fit the transformer by identifying top-k n-grams for each column.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

NGram

transform(X)[source]#

Transform the input DataFrame by creating n-gram count features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with n-gram count features.

Return type:

DataFrame

class gators.feature_generation_str.Occurrences[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Counts occurrences of specific substrings or characters in string columns.

Creates count features showing how many times a substring appears, useful for tree-based models to split on frequency patterns.

Parameters:
  • subset (Optional[List[str]], default=None) – List of string columns to extract features from. If None, all string columns will be used.

  • substrings (Dict[str, List[str]]) – Dictionary mapping column names to lists of substrings to count. For example: {“description”: [“error”, “warning”, “success”]} will create 3 count features for the “description” column.

  • case_sensitive (bool, default=False) – Whether substring matching should be case sensitive.

  • drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.

Examples

>>> from gators.feature_generation_str import Occurrences
>>> import polars as pl
>>> X =pl.DataFrame({
...     'log': ['Error: invalid input', 'Success: completed', 'Error: timeout Error', None],
...     'tags': ['#python #ml #data', '#python #java', '#ml #python', '']
... })

Example 1: Count specific keywords

>>> transformer = Occurrences(
...     subset=['log'],
...     substrings={'log': ['error', 'success', 'timeout']}
... )
>>> result = transformer.fit_transform(X)
>>> print(result)
shape: (4, 5)
┌───────────────────────────┬────────────────────┬─────────────┬────────────────┬────────────────┐
│ log                       ┆ tags               ┆ log__error  ┆ log__success   ┆ log__timeout   │
│ ---                       ┆ ---                ┆ ---         ┆ ---            ┆ ---            │
│ str                       ┆ str                ┆ i64         ┆ i64            ┆ i64            │
├───────────────────────────┼────────────────────┼─────────────┼────────────────┼────────────────┤
│ Error: invalid input      ┆ #python #ml #data  ┆ 1           ┆ 0              ┆ 0              │
│ Success: completed        ┆ #python #java      ┆ 0           ┆ 1              ┆ 0              │
│ Error: timeout Error      ┆ #ml #python        ┆ 2           ┆ 0              ┆ 1              │
│ null                      ┆                    ┆ 0           ┆ 0              ┆ 0              │
└───────────────────────────┴────────────────────┴─────────────┴────────────────┴────────────────┘

Example 2: Count hashtags (case sensitive)

>>> transformer = Occurrences(
...     subset=['tags'],
...     substrings={'tags': ['#python', '#ml', '#java', '#data']},
...     case_sensitive=True
... )
>>> result = transformer.fit_transform(X)

Example 3: Multiple columns with drop

>>> transformer = Occurrences(
...     subset=['log', 'tags'],
...     substrings={
...         'log': ['error', 'success'],
...         'tags': ['#python', '#ml']
...     },
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)
fit(X, y=None)[source]#

Fit the transformer by identifying string columns if not specified.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Occurrences

transform(X)[source]#

Transform the input DataFrame by counting substring occurrences.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with substring count features.

Return type:

DataFrame

class gators.feature_generation_str.PatternDetector[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Detects common patterns in string columns (emails, URLs, phone numbers, etc.).

Creates boolean features indicating whether strings match common patterns, useful for tree-based models to branch on data format and validity.

Parameters:
  • subset (Optional[List[str]], default=None) – List of string columns to extract features from. If None, all string columns will be used.

  • patterns (List[str], default=["is_numeric", "is_email", "is_url", "is_phone"]) –

    Patterns to detect. Options:

    • ”is_numeric”: Contains only digits (possibly with decimal/negative)

    • ”is_email”: Matches email pattern (basic check)

    • ”is_url”: Matches URL pattern (http/https)

    • ”is_phone”: Matches phone number pattern

    • ”is_alphanumeric”: Contains only letters and digits

    • ”is_alpha”: Contains only letters

    • ”has_http”: Contains http:// or https://

    • ”has_www”: Contains www.

    • ”has_at”: Contains @ symbol

  • drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.

Examples

>>> from gators.feature_generation_str import PatternDetector
>>> import polars as pl
>>> X =pl.DataFrame({
...     'contact': ['user@test.com', 'https://site.com', '555-1234', 'Hello World', None],
...     'code': ['ABC123', '999', 'test@email', 'XYZ', '']
... })

Example 1: Email and URL detection

>>> transformer = PatternDetector(
...     subset=['contact'],
...     patterns=['is_email', 'is_url', 'is_phone']
... )
>>> result = transformer.fit_transform(X)
>>> print(result)
shape: (5, 5)
┌────────────────────┬─────────┬───────────────────┬──────────────────┬───────────────────┐
│ contact            ┆ code    ┆ contact__is_email ┆ contact__is_url  ┆ contact__is_phone │
│ ---                ┆ ---     ┆ ---               ┆ ---              ┆ ---               │
│ str                ┆ str     ┆ bool              ┆ bool             ┆ bool              │
├────────────────────┼─────────┼───────────────────┼──────────────────┼───────────────────┤
│ user@test.com      ┆ ABC123  ┆ true              ┆ false            ┆ false             │
│ https://site.com   ┆ 999     ┆ false             ┆ true             ┆ false             │
│ 555-1234           ┆ test@e… ┆ false             ┆ false            ┆ true              │
│ Hello World        ┆ XYZ     ┆ false             ┆ false            ┆ false             │
│ null               ┆         ┆ false             ┆ false            ┆ false             │
└────────────────────┴─────────┴───────────────────┴──────────────────┴───────────────────┘

Example 2: Numeric and alphanumeric detection

>>> transformer = PatternDetector(
...     subset=['code'],
...     patterns=['is_numeric', 'is_alphanumeric', 'is_alpha']
... )
>>> result = transformer.fit_transform(X)

Example 3: URL component detection

>>> transformer = PatternDetector(
...     subset=['contact'],
...     patterns=['has_http', 'has_www', 'has_at'],
...     drop_columns=True
... )
>>> result = transformer.fit_transform(X)
fit(X, y=None)[source]#

Fit the transformer by identifying string columns if not specified.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

PatternDetector

transform(X)[source]#

Transform the input DataFrame by creating pattern detection features.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame with pattern detection features.

Return type:

DataFrame

class gators.feature_generation_str.Split[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates features by splitting columns into multiple columns based on a delimiter.

The number of split columns is specified by the max_splits parameter to ensure consistent column structure between training and new data.

Parameters:
  • subset (List[str]) – List of column names to split.

  • by (str) – Delimiter to split the columns by.

  • max_splits (PositiveInt) – Maximum number of split columns to create. If a value has more splits, extra splits are truncated. If fewer, remaining columns are filled with empty strings.

  • drop_columns (bool, optional) – Whether to drop the original columns after splitting, by default True.

Examples

>>> from gators.feature_generation_str import Split
>>> import polars as pl
>>> X ={'full_name': ['John Doe', 'Jane Smith Williams', 'Alice Johnson']}
>>> X = pl.DataFrame(X)

Example 1: Split with max_splits=3 and drop_columns=True (default)

>>> transformer = Split(subset=['full_name'], by=' ', max_splits=3, drop_columns=True)
>>> transformer.fit(X)
Split(subset=['full_name'], by=' ', max_splits=3, drop_columns=True)
>>> result = transformer.transform(X)
>>> result
shape: (3, 3)
┌─────────────────────┬─────────────────────┬─────────────────────┐
│ full_name__split_0  │ full_name__split_1  │ full_name__split_2  │
│         str         │         str         │         str         │
├─────────────────────┼─────────────────────┼─────────────────────┤
│        John         │         Doe         │                     │
│        Jane         │        Smith        │      Williams       │
│       Alice         │       Johnson       │                     │
└─────────────────────┴─────────────────────┴─────────────────────┘

Example 2: Split with max_splits=2 and drop_columns=False

>>> transformer = Split(subset=['full_name'], by=' ', max_splits=2, drop_columns=False)
>>> transformer.fit(X)
Split(subset=['full_name'], by=' ', max_splits=2, drop_columns=False)
>>> result = transformer.transform(X)
>>> result
shape: (3, 3)
┌──────────────────────┬─────────────────────┬─────────────────────┐
│      full_name       │ full_name__split_0  │ full_name__split_1  │
│         str          │         str         │         str         │
├──────────────────────┼─────────────────────┼─────────────────────┤
│      John Doe        │        John         │         Doe         │
│  Jane Smith Williams │        Jane         │        Smith        │
│   Alice Johnson      │       Alice         │       Johnson       │
└──────────────────────┴─────────────────────┴─────────────────────┘
fit(X, y=None)[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Split

transform(X)[source]#

Transform the input DataFrame by splitting columns.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.SplitExtract[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates features by splitting columns and extracting a specific part by index.

Parameters:
  • subset (List[str]) – List of column names to split and extract from.

  • by (str) – Delimiter to split the columns by.

  • n (NonNegativeInt) – Index of the part to extract from the split (0-indexed).

  • drop_columns (bool, optional) – Whether to drop the original columns after splitting, by default True.

Examples

>>> from gators.feature_generation_str import SplitExtract
>>> import polars as pl
>>> X ={'full_name': ['John Doe', 'Jane Smith', 'Alice Johnson']}
>>> X = pl.DataFrame(X)

Example 1: Extract first part (n=0)

>>> transformer = SplitExtract(subset=['full_name'], by=' ', n=0, drop_columns=True)
>>> transformer.fit(X)
SplitExtract(subset=['full_name'], by=' ', n=0, drop_columns=True)
>>> result = transformer.transform(X)
>>> result
shape: (3, 1)
┌────────────────────┐
│ full_name__split_0 │
│         str        │
├────────────────────┤
│        John        │
│        Jane        │
│       Alice        │
└────────────────────┘

Example 2: Extract second part (n=1)

>>> transformer = SplitExtract(subset=['full_name'], by=' ', n=1, drop_columns=False)
>>> transformer.fit(X)
SplitExtract(subset=['full_name'], by=' ', n=1, drop_columns=False)
>>> result = transformer.transform(X)
>>> result
shape: (3, 2)
┌──────────────────┬────────────────────┐
│    full_name     │ full_name__split_1 │
│       str        │         str        │
├──────────────────┼────────────────────┤
│     John Doe     │        Doe         │
│    Jane Smith    │       Smith        │
│   Alice Johnson  │      Johnson       │
└──────────────────┴────────────────────┘
fit(X, y=None)[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

SplitExtract

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.Startswith[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Generates Boolean features to indicate if strings in the original columns start with specified substrings.

Examples

Create an instance of the Startswith class:

>>> from gators.feature_feneration_str import Startswith
>>> startswith_dict = {'col1': ['pre1', 'pre2'], 'col2': ['pre3']}
>>> transformer = Startswith(startswith_dict=startswith_dict)

Fit the transformer:

>>> transformer.fit(X)

Transform the DataFrame:

>>> X = pl.DataFrame({"col1": ["pre1_sample", None, "pre2_sample"],
...                     "col2": [None, "pre3_sample", "no_match"]})
>>> transformed_X = transformer.transform(X)
>>> print(transformed_X)
shape: (3, 5)
┌───────────────┬─────────────┬───────────────────────┬───────────────────────┬───────────────────────┐
│ col1          │ col2        │ col1__startswith_pre1 │ col1__startswith_pre2 │ col2__startswith_pre3 │
├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤
│ pre1_sample   │ None        │ True                  │ False                 │ None                  │
├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤
│ None          │ pre3_sample │ None                  │ None                  │ True                  │
├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤
│ pre2_sample   │ no_match    │ False                 │ True                  │ False                 │
└───────────────┴─────────────┴───────────────────────┴───────────────────────┴───────────────────────┘
fit(X, y=None)[source]#

Fit the transformer (no-op, but required for sklearn compatibility).

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Startswith

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame

class gators.feature_generation_str.Upper[source]#

Bases: BaseModel, BaseEstimator, TransformerMixin

Converts string and Boolean columns to uppercase.

Parameters:
  • subset (Optional[List[str]], default=None) – List of columns to convert to uppercase.

  • drop_columns (bool, default=True) – Whether to drop original columns after transformation.

Examples

>>> import polars as pl
>>> from gators.discretizers import Upper
>>> # Sample data
>>> X =pl.DataFrame({
...     'A': ['cat', 'dog', 'cat', 'dog', 'cat'],
...     'B': ['yes', 'no', 'yes', 'no', 'yes'],
...     'C': ['quick', 'brown', 'fox', 'jumps', 'over']
... })
>>> # Transform to uppercase with default parameters (drop_columns=True)
>>> encoder = Upper()
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 3)
┌──────────┬──────────┬──────────┐
│ A__upper │ B__upper │ C__upper │
│ ---      │ ---      │ ---      │
│ str      │ str      │ str      │
├──────────┼──────────┼──────────┤
│ CAT      │ YES      │ QUICK    │
│ DOG      │ NO       │ BROWN    │
│ CAT      │ YES      │ FOX      │
│ DOG      │ NO       │ JUMPS    │
│ CAT      │ YES      │ OVER     │
└──────────┴──────────┴──────────┘
>>> # Transform to uppercase with drop_columns=False
>>> encoder = Upper(drop_columns=False)
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 6)
┌─────┬──────┬───────┬──────────┬──────────┬──────────┐
│ A   │ B    │ C     │ A__upper │ B__upper │ C__upper │
│ --- │ ---  │ ---   │ ---      │ ---      │ ---      │
│ str │ str  │ str   │ str      │ str      │ str      │
├─────┼──────┼───────┼──────────┼──────────┼──────────┤
│ cat │ yes  │ quick │ CAT      │ YES      │ QUICK    │
│ dog │ no   │ brown │ DOG      │ NO       │ BROWN    │
│ cat │ yes  │ fox   │ CAT      │ YES      │ FOX      │
│ dog │ no   │ jumps │ DOG      │ NO       │ JUMPS    │
│ cat │ yes  │ over  │ CAT      │ YES      │ OVER     │
└─────┴──────┴───────┴──────────┴──────────┴──────────┘
>>> # Transform to uppercase with columns as a subset
>>> encoder = Upper(subset=['B'])
>>> encoder.fit(X)
>>> transformed_X =encoder.transform(X)
>>> print(transformed_X)
shape: (5, 4)
┌─────┬──────┬───────┬──────────┐
│ A   │ B    │ C     │ B__upper │
│ --- │ ---  │ ---   │ ---      │
│ str │ str  │ str   │ str      │
├─────┼──────┼───────┼──────────┤
│ cat │ yes  │ quick │ YES      │
│ dog │ no   │ brown │ NO       │
│ cat │ yes  │ fox   │ YES      │
│ dog │ no   │ jumps │ NO       │
│ cat │ yes  │ over  │ YES      │
└─────┴──────┴───────┴──────────┘
fit(X, y=None)[source]#

Fit the transformer by identifying categorical columns and generating column mappings.

Parameters:
  • X (DataFrame) – Input DataFrame.

  • y (Series | None) – Target variable. Not used, present here for compatibility.

Returns:

Fitted transformer instance.

Return type:

Upper

transform(X)[source]#

Transform the input DataFrame by extracting specified components.

Parameters:

X (DataFrame) – Input DataFrame to transform.

Returns:

Transformed DataFrame.

Return type:

DataFrame