gators.feature_generation_str package#
Module contents#
- class gators.feature_generation_str.CharacterStatistics[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates character-level statistical features from string columns.
Counts various character types (digits, letters, spaces, etc.) which are particularly useful for tree-based models to identify patterns in text data.
- Parameters:
subset (Optional[List[str]], default=None) – List of string columns to extract features from. If None, all string columns will be used.
features (List[str], default=["n_digits", "n_letters", "n_uppercase", "n_lowercase", "n_spaces", "n_special"]) –
Character statistics to generate. Options:
”n_digits”: Count of digit characters (0-9)
”n_letters”: Count of alphabetic characters (a-z, A-Z)
”n_uppercase”: Count of uppercase letters
”n_lowercase”: Count of lowercase letters
”n_spaces”: Count of space characters
”n_special”: Count of special characters (punctuation, symbols)
”n_unique_chars”: Count of unique characters
”ratio_uppercase”: Ratio of uppercase to total letters
”ratio_digits”: Ratio of digits to total length
”ratio_special”: Ratio of special chars to total length
drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.
Examples
>>> from gators.feature_generation_str import CharacterStatistics >>> import polars as pl
>>> X =pl.DataFrame({ ... 'text': ['Hello123', 'WORLD!!!', 'Test 99', ''], ... 'email': ['user@test.com', 'ADMIN@SITE.ORG', 'test', None] ... })
Example 1: Basic character counts
>>> transformer = CharacterStatistics( ... subset=['text'], ... features=['n_digits', 'n_letters', 'n_uppercase'] ... ) >>> result = transformer.fit_transform(X) >>> print(result) shape: (4, 5) ┌──────────┬────────────────┬───────────────┬───────────────────┬─────────────────────┐ │ text ┆ email ┆ text__n_digit ┆ text__n_letters ┆ text__n_uppercase │ │ --- ┆ --- ┆ s ┆ --- ┆ --- │ │ str ┆ str ┆ --- ┆ i64 ┆ i64 │ │ ┆ ┆ i64 ┆ ┆ │ ├──────────┼────────────────┼───────────────┼───────────────────┼─────────────────────┤ │ Hello123 ┆ user@test.com ┆ 3 ┆ 5 ┆ 1 │ │ WORLD!!! ┆ ADMIN@SITE.ORG ┆ 0 ┆ 5 ┆ 5 │ │ Test 99 ┆ test ┆ 2 ┆ 4 ┆ 1 │ │ ┆ null ┆ 0 ┆ 0 ┆ 0 │ └──────────┴────────────────┴───────────────┴───────────────────┴─────────────────────┘
Example 2: Ratio features
>>> transformer = CharacterStatistics( ... subset=['text', 'email'], ... features=['ratio_uppercase', 'ratio_digits', 'ratio_special'] ... ) >>> result = transformer.fit_transform(X)
Example 3: All features with drop_columns
>>> transformer = CharacterStatistics( ... subset=['text'], ... features=['n_digits', 'n_letters', 'n_spaces', 'n_special', 'n_unique_chars'], ... drop_columns=True ... ) >>> result = transformer.fit_transform(X)
- class gators.feature_generation_str.CombineFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinCombines specific string/categorical columns to create composite key features (UID-like).
Unlike InteractionFeatures which generates all combinations, this transformer only creates the specific column combinations you provide, making it more efficient for creating unique identifiers or specific interaction features.
- Parameters:
column_groups (List[List[str]]) – List of column groups to combine. Each group is a list of column names that will be concatenated together. Example: [[‘cat1’, ‘cat2’], [‘cat1’, ‘addr1’]]
separator (str, default='_') – String to use as separator when combining column values.
drop_columns (bool, default=False) – Whether to drop the original columns after creating combinations.
new_column_names (Optional[List[str]], default=None) – List of custom names for the combined columns. If None, uses default naming pattern where columns are joined with ‘__’ (e.g., ‘cat1__cat2’). Must have same length as column_groups.
Examples
>>> from gators.feature_generation_str import CombineFeatures >>> import polars as pl
>>> X ={ ... 'cat1': ['A', 'A', 'B', 'B', 'A'], ... 'cat2': ['X', 'Y', 'X', 'Y', 'X'], ... 'addr1': ['US', 'US', 'UK', 'UK', 'CA'], ... 'amount': [100, 200, 150, 300, 250] ... } >>> X = pl.DataFrame(X)
Example 1: Basic combination
>>> transformer = CombineFeatures( ... column_groups=[['cat1', 'cat2']] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (5, 5) ┌───────┬───────┬───────┬────────┬──────────────┐ │ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ cat1__cat2 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ i64 ┆ str │ ╞═══════╪═══════╪═══════╪════════╪══════════════╡ │ A ┆ X ┆ US ┆ 100 ┆ A_X │ │ A ┆ Y ┆ US ┆ 200 ┆ A_Y │ │ B ┆ X ┆ UK ┆ 150 ┆ B_X │ │ B ┆ Y ┆ UK ┆ 300 ┆ B_Y │ │ A ┆ X ┆ CA ┆ 250 ┆ A_X │ └───────┴───────┴───────┴────────┴──────────────┘
Example 2: Multiple combinations
>>> transformer = CombineFeatures( ... column_groups=[['cat1', 'cat2'], ['cat1', 'addr1']] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (5, 6) ┌───────┬───────┬───────┬────────┬──────────────┬───────────────┐ │ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ cat1__cat2 ┆ cat1__addr1 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ i64 ┆ str ┆ str │ ╞═══════╪═══════╪═══════╪════════╪══════════════╪═══════════════╡ │ A ┆ X ┆ US ┆ 100 ┆ A_X ┆ A_US │ │ A ┆ Y ┆ US ┆ 200 ┆ A_Y ┆ A_US │ │ B ┆ X ┆ UK ┆ 150 ┆ B_X ┆ B_UK │ │ B ┆ Y ┆ UK ┆ 300 ┆ B_Y ┆ B_UK │ │ A ┆ X ┆ CA ┆ 250 ┆ A_X ┆ A_CA │ └───────┴───────┴───────┴────────┴──────────────┴───────────────┘
Example 3: Custom separator and column names
>>> transformer = CombineFeatures( ... column_groups=[['cat1', 'cat2', 'addr1']], ... separator='|', ... new_column_names=['uid'] ... ) >>> result = transformer.fit_transform(X) >>> result shape: (5, 5) ┌───────┬───────┬───────┬────────┬──────────┐ │ cat1 ┆ cat2 ┆ addr1 ┆ amount ┆ uid │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ i64 ┆ str │ ╞═══════╪═══════╪═══════╪════════╪══════════╡ │ A ┆ X ┆ US ┆ 100 ┆ A|X|US │ │ A ┆ Y ┆ US ┆ 200 ┆ A|Y|US │ │ B ┆ X ┆ UK ┆ 150 ┆ B|X|UK │ │ B ┆ Y ┆ UK ┆ 300 ┆ B|Y|UK │ │ A ┆ X ┆ CA ┆ 250 ┆ A|X|CA │ └───────┴───────┴───────┴────────┴──────────┘
Example 4: Creating UIDs for fraud detection
>>> # Create composite keys for unique user identification >>> transformer = CombineFeatures( ... column_groups=[ ... ['cat1', 'cat2', 'card3'], # Card combination ... ['cat1', 'addr1'], # Card + address ... ['email_domain', 'cat1'] # Email + card ... ], ... new_column_names=['card_uid', 'card_addr_uid', 'email_card_uid'] ... ) >>> # Can then use these UIDs for frequency encoding or groupby operations
Notes
Null values are converted to string “null” before concatenation
All column values are cast to string before combining
Useful for creating unique identifiers (UIDs) for user/card tracking
More efficient than InteractionFeatures when you know exactly which combinations you need
- class gators.feature_generation_str.Contains[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates Boolean columns indicating if substrings are contained within the original column values.
Examples
Create an instance of the Contains class:
>>> from gators.feature_generation_str import Contains >>> contains_dict = {'col1': ['sub1', 'sub2'], 'col2': ['sub3']} >>> transformer = Contains(contains_dict=contains_dict)
Fit the transformer:
>>> transformer.fit(X)
Transform the dataframe:
>>> X = pl.DataFrame({"col1": ["sub1 here", None, "sub2 here"], ... "col2": [None, "sub3 inside", "no match"]}) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (3, 5) ╭─────────────┬───────────────┬──────────────┬──────────────┬──────────────╮ │ col1 │ col2 │ col1__sub1 │ col1__sub2 │ col2__sub3 │ ├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤ │ sub1 here │ None │ True │ False │ None │ ├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤ │ None │ sub3 inside │ None │ None │ False │ ├─────────────┼───────────────┼──────────────┼──────────────┼──────────────┤ │ sub2 here │ no match │ False │ True │ False │ ╰─────────────┴───────────────┴──────────────┴──────────────┴──────────────╯
- class gators.feature_generation_str.Endswith[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates Boolean features indicating if substrings are at the end of the original column values.
Examples
Create an instance of the Endswith class:
>>> from gators.feature_feneration_str import Endswith >>> endswith_dict = {'col1': ['end1', 'end2'], 'col2': ['end3']} >>> transformer = Endswith(endswith_dict=endswith_dict)
Fit the transformer:
>>> transformer.fit(X)
Transform the dataframe:
>>> X = pl.DataFrame({"col1": ["this end1", None, "that end2"], ... "col2": [None, "one end3", "another no end"]}) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (3, 5) ╭─────────────┬───────────────┬─────────────────────┬─────────────────────┬─────────────────────╮ │ col1 │ col2 │ col1__endswith_end1 │ col1__endswith_end2 │ col2__endswith_end3 │ ├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤ │ this end1 │ None │ True │ False │ None │ ├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤ │ None │ one end3 │ None │ None │ True │ ├─────────────┼───────────────┼─────────────────────┼─────────────────────┼─────────────────────┤ │ that end2 │ another no end│ False │ True │ False│ ╰─────────────┴───────────────┴─────────────────────┴─────────────────────┴─────────────────────╯
- class gators.feature_generation_str.ExtractSubstring[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixin
- class gators.feature_generation_str.InteractionFeatures[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates interaction features for categorical variables.
- Parameters:
subset (Optional[List[str]], default=None) – List of columns to consider for interaction.
degree (conint(gt=1), default=2) – Degree of interaction terms.
Examples
>>> import polars as pl >>> from gators.feature_generation_str import InteractionFeatures
>>> # Sample data >>> X =pl.DataFrame({ ... 'A': ['cat', 'dog', 'cat', 'dog', 'cat'], ... 'B': ['x', 'x', 'y', 'y', 'x'], ... 'C': ['red', 'blue', 'green', 'blue', 'red'] ... })
>>> # Interaction with default parameters (degree=2) >>> interaction_features = InteractionFeatures() >>> interaction_features.fit(X) >>> transformed_X =interaction_features.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────┬─────┬──────┬─────────────┐ │ A │ B │ C │ A__B │ │ --- │ --- │ --- │ --- │ │ str │ str │ str │ str │ ├─────┼─────┼──────┼─────────────┤ │ cat │ x │ red │ cat_x │ │ dog │ x │ blue │ dog_x │ │ cat │ y │ green│ cat_y │ │ dog │ y │ blue │ dog_y │ │ cat │ x │ red │ cat_x │ └─────┴─────┴──────┴─────────────┘
>>> # Interaction with degree=3 >>> interaction_features = InteractionFeatures(degree=3) >>> interaction_features.fit(X) >>> transformed_X =interaction_features.transform(X) >>> print(transformed_X) shape: (5, 8) ┌─────┬─────┬──────┬─────────────┬──────────────┬───────────────┬───────────────┐ │ A │ B │ C │ A__B │ A__C │ B__C │ A__B__C │ │ --- │ --- │ --- │ --- │ --- │ --- │ --- │ │ str │ str │ str │ str │ str │ str │ str │ ├─────┼─────┼──────┼─────────────┼──────────────┼───────────────┼───────────────┤ │ cat │ x │ red │ cat_x │ cat_red │ x_red │ cat_x_red │ │ dog │ x │ blue │ dog_x │ dog_blue │ x_blue │ dog_x_blue │ │ cat │ y │ green│ cat_y │ cat_green │ y_green │ cat_y_green │ │ dog │ y │ blue │ dog_y │ dog_blue │ y_blue │ dog_y_blue │ │ cat │ x │ red │ cat_x │ cat_red │ x_red │ cat_x_red │ └─────┴─────┴──────┴─────────────┴──────────────┴───────────────┴───────────────┘
>>> # Interaction with columns=None >>> interaction_features = InteractionFeatures(subset=['A', 'B']) >>> interaction_features.fit(X) >>> transformed_X =interaction_features.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────┬─────┬──────┬─────────────┐ │ A │ B │ C │ A__B │ │ --- │ --- │ --- │ --- │ │ str │ str │ str │ str │ ├─────┼─────┼──────┼─────────────┤ │ cat │ x │ red │ cat_x │ │ dog │ x │ blue │ dog_x │ │ cat │ y │ green│ cat_y │ │ dog │ y │ blue │ dog_y │ │ cat │ x │ red │ cat_x │ └─────┴─────┴──────┴─────────────┘
- class gators.feature_generation_str.Length[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates features based on the length of the variables.
- Parameters:
subset (Optional[List[str]], default=None) – List of columns to calculate length for.
Examples
>>> import polars as pl >>> from gators.discretizers import Length
>>> # Sample data >>> X =pl.DataFrame({ ... 'A': ['cat', 'dog', 'cat', 'dog', 'cat'], ... 'B': ['yes', 'no', 'yes', 'no', 'yes'], ... 'C': ['quick', 'brown', 'fox', 'jumps', 'over'] ... })
>>> # Calculate lengths with default parameters >>> encoder = Length() >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 6) ┌─────┬──────┬───────┬──────────┬──────────┬───────────┐ │ A │ B │ C │ A__length│ B__length│ C__length │ │ --- │ --- │ --- │ --- │ --- │ --- │ │ str │ str │ str │ i64 │ i64 │ i64 │ ├─────┼──────┼───────┼──────────┼──────────┼───────────┤ │ cat │ yes │ quick │ 3 │ 3 │ 5 │ │ dog │ no │ brown │ 3 │ 2 │ 5 │ │ cat │ yes │ fox │ 3 │ 3 │ 3 │ │ dog │ no │ jumps │ 3 │ 2 │ 5 │ │ cat │ yes │ over │ 3 │ 3 │ 4 │ └─────┴──────┴───────┴──────────┴──────────┴───────────┘
>>> # Calculate lengths with columns as a subset >>> encoder = Length(subset=['B']) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────┬──────┬───────┬──────────┐ │ A │ B │ C │ B__length│ │ --- │ --- │ --- │ --- │ │ str │ str │ str │ i64 │ ├─────┼──────┼───────┼──────────┤ │ cat │ yes │ quick │ 3 │ │ dog │ no │ brown │ 2 │ │ cat │ yes │ fox │ 3 │ │ dog │ no │ jumps │ 2 │ │ cat │ yes │ over │ 3 │ └─────┴──────┴───────┴──────────┘
- class gators.feature_generation_str.Lower[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinConverts string and Boolean columns to lowercase.
Examples
Create an instance of the Lower class:
>>> from gators.feature_feneration_str import Lower >>> lower = Lower(subset=["col1"], drop_columns=False)
Fit the transformer:
>>> lower.fit(X)
Transform the dataframe:
>>> X = pl.DataFrame({"col1": ["Hello", "WORLD"], "col2": [True, False]}) >>> transformed_X = lower.transform(X) >>> print(transformed_X) shape: (2, 3) ┌───────┬───────┬─────────────┐ │ col1 │ col2 │ col1__lower │ ├───────┼───────┼─────────────┤ │ HeLLo │ True │ hello │ │ WORLD │ False │ world │ └───────┴───────┴─────────────┘
If drop_columns is True, the original columns are dropped:
>>> lower.drop_columns = True >>> transformed_X = lower.transform(X) >>> print(transformed_X) shape: (2, 1) ┌─────────────┐ │ col1__lower │ ├─────────────┤ │ hello │ │ world │ └─────────────┘
- class gators.feature_generation_str.NGram[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinExtracts character or word n-grams from string columns.
Creates count features for the most common n-grams, useful for tree-based models to capture local text patterns and sequences.
- Parameters:
subset (Optional[List[str]], default=None) – List of string columns to extract n-grams from. If None, all string columns will be used.
n (int, default=2) – Size of n-grams to extract (e.g., 2 for bigrams, 3 for trigrams).
ngram_type (str, default="char") –
Type of n-grams to extract:
”char”: Character-level n-grams (e.g., “ab”, “bc” from “abc”)
”word”: Word-level n-grams (e.g., “hello world” as bigram)
max_features (int, default=10) – Maximum number of most common n-grams to extract per column. Top-k n-grams are selected during fit() based on frequency.
min_count (int, default=1) – Minimum number of occurrences for an n-gram to be included.
drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.
Examples
>>> from gators.feature_generation_str import NGram >>> import polars as pl
>>> X =pl.DataFrame({ ... 'text': ['hello world', 'hello there', 'world peace', None], ... 'desc': ['test data', 'test case', 'data case', ''] ... })
Example 1: Character bigrams
>>> transformer = NGram( ... subset=['text'], ... n=2, ... ngram_type='char', ... max_features=5 ... ) >>> result = transformer.fit_transform(X) >>> print(result.columns) ['text', 'desc', 'text__ng_he', 'text__ng_el', 'text__ng_ll', 'text__ng_lo', 'text__ng_o_'] # Features for top-5 character bigrams: 'he', 'el', 'll', 'lo', 'o '
Example 2: Word bigrams
>>> transformer = NGram( ... subset=['text'], ... n=2, ... ngram_type='word', ... max_features=3 ... ) >>> result = transformer.fit_transform(X) # Features for word bigrams: 'hello world', 'hello there', 'world peace'
Example 3: Character trigrams with min_count
>>> transformer = NGram( ... subset=['desc'], ... n=3, ... ngram_type='char', ... max_features=5, ... min_count=2 ... ) >>> result = transformer.fit_transform(X) # Only trigrams appearing at least 2 times are considered
Example 4: Multiple columns with drop
>>> transformer = NGram( ... subset=['text', 'desc'], ... n=2, ... ngram_type='char', ... max_features=10, ... drop_columns=True ... ) >>> result = transformer.fit_transform(X)
- class gators.feature_generation_str.Occurrences[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinCounts occurrences of specific substrings or characters in string columns.
Creates count features showing how many times a substring appears, useful for tree-based models to split on frequency patterns.
- Parameters:
subset (Optional[List[str]], default=None) – List of string columns to extract features from. If None, all string columns will be used.
substrings (Dict[str, List[str]]) – Dictionary mapping column names to lists of substrings to count. For example: {“description”: [“error”, “warning”, “success”]} will create 3 count features for the “description” column.
case_sensitive (bool, default=False) – Whether substring matching should be case sensitive.
drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.
Examples
>>> from gators.feature_generation_str import Occurrences >>> import polars as pl
>>> X =pl.DataFrame({ ... 'log': ['Error: invalid input', 'Success: completed', 'Error: timeout Error', None], ... 'tags': ['#python #ml #data', '#python #java', '#ml #python', ''] ... })
Example 1: Count specific keywords
>>> transformer = Occurrences( ... subset=['log'], ... substrings={'log': ['error', 'success', 'timeout']} ... ) >>> result = transformer.fit_transform(X) >>> print(result) shape: (4, 5) ┌───────────────────────────┬────────────────────┬─────────────┬────────────────┬────────────────┐ │ log ┆ tags ┆ log__error ┆ log__success ┆ log__timeout │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 ┆ i64 ┆ i64 │ ├───────────────────────────┼────────────────────┼─────────────┼────────────────┼────────────────┤ │ Error: invalid input ┆ #python #ml #data ┆ 1 ┆ 0 ┆ 0 │ │ Success: completed ┆ #python #java ┆ 0 ┆ 1 ┆ 0 │ │ Error: timeout Error ┆ #ml #python ┆ 2 ┆ 0 ┆ 1 │ │ null ┆ ┆ 0 ┆ 0 ┆ 0 │ └───────────────────────────┴────────────────────┴─────────────┴────────────────┴────────────────┘
Example 2: Count hashtags (case sensitive)
>>> transformer = Occurrences( ... subset=['tags'], ... substrings={'tags': ['#python', '#ml', '#java', '#data']}, ... case_sensitive=True ... ) >>> result = transformer.fit_transform(X)
Example 3: Multiple columns with drop
>>> transformer = Occurrences( ... subset=['log', 'tags'], ... substrings={ ... 'log': ['error', 'success'], ... 'tags': ['#python', '#ml'] ... }, ... drop_columns=True ... ) >>> result = transformer.fit_transform(X)
- class gators.feature_generation_str.PatternDetector[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinDetects common patterns in string columns (emails, URLs, phone numbers, etc.).
Creates boolean features indicating whether strings match common patterns, useful for tree-based models to branch on data format and validity.
- Parameters:
subset (Optional[List[str]], default=None) – List of string columns to extract features from. If None, all string columns will be used.
patterns (List[str], default=["is_numeric", "is_email", "is_url", "is_phone"]) –
Patterns to detect. Options:
”is_numeric”: Contains only digits (possibly with decimal/negative)
”is_email”: Matches email pattern (basic check)
”is_url”: Matches URL pattern (http/https)
”is_phone”: Matches phone number pattern
”is_alphanumeric”: Contains only letters and digits
”is_alpha”: Contains only letters
”has_www”: Contains www.
”has_at”: Contains @ symbol
drop_columns (bool, default=False) – Whether to drop the original string columns after feature extraction.
Examples
>>> from gators.feature_generation_str import PatternDetector >>> import polars as pl
>>> X =pl.DataFrame({ ... 'contact': ['user@test.com', 'https://site.com', '555-1234', 'Hello World', None], ... 'code': ['ABC123', '999', 'test@email', 'XYZ', ''] ... })
Example 1: Email and URL detection
>>> transformer = PatternDetector( ... subset=['contact'], ... patterns=['is_email', 'is_url', 'is_phone'] ... ) >>> result = transformer.fit_transform(X) >>> print(result) shape: (5, 5) ┌────────────────────┬─────────┬───────────────────┬──────────────────┬───────────────────┐ │ contact ┆ code ┆ contact__is_email ┆ contact__is_url ┆ contact__is_phone │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ bool ┆ bool ┆ bool │ ├────────────────────┼─────────┼───────────────────┼──────────────────┼───────────────────┤ │ user@test.com ┆ ABC123 ┆ true ┆ false ┆ false │ │ https://site.com ┆ 999 ┆ false ┆ true ┆ false │ │ 555-1234 ┆ test@e… ┆ false ┆ false ┆ true │ │ Hello World ┆ XYZ ┆ false ┆ false ┆ false │ │ null ┆ ┆ false ┆ false ┆ false │ └────────────────────┴─────────┴───────────────────┴──────────────────┴───────────────────┘
Example 2: Numeric and alphanumeric detection
>>> transformer = PatternDetector( ... subset=['code'], ... patterns=['is_numeric', 'is_alphanumeric', 'is_alpha'] ... ) >>> result = transformer.fit_transform(X)
Example 3: URL component detection
>>> transformer = PatternDetector( ... subset=['contact'], ... patterns=['has_http', 'has_www', 'has_at'], ... drop_columns=True ... ) >>> result = transformer.fit_transform(X)
- class gators.feature_generation_str.Split[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates features by splitting columns into multiple columns based on a delimiter.
The number of split columns is specified by the max_splits parameter to ensure consistent column structure between training and new data.
- Parameters:
subset (List[str]) – List of column names to split.
by (str) – Delimiter to split the columns by.
max_splits (PositiveInt) – Maximum number of split columns to create. If a value has more splits, extra splits are truncated. If fewer, remaining columns are filled with empty strings.
drop_columns (bool, optional) – Whether to drop the original columns after splitting, by default True.
Examples
>>> from gators.feature_generation_str import Split >>> import polars as pl
>>> X ={'full_name': ['John Doe', 'Jane Smith Williams', 'Alice Johnson']} >>> X = pl.DataFrame(X)
Example 1: Split with max_splits=3 and drop_columns=True (default)
>>> transformer = Split(subset=['full_name'], by=' ', max_splits=3, drop_columns=True) >>> transformer.fit(X) Split(subset=['full_name'], by=' ', max_splits=3, drop_columns=True) >>> result = transformer.transform(X) >>> result shape: (3, 3) ┌─────────────────────┬─────────────────────┬─────────────────────┐ │ full_name__split_0 │ full_name__split_1 │ full_name__split_2 │ │ str │ str │ str │ ├─────────────────────┼─────────────────────┼─────────────────────┤ │ John │ Doe │ │ │ Jane │ Smith │ Williams │ │ Alice │ Johnson │ │ └─────────────────────┴─────────────────────┴─────────────────────┘
Example 2: Split with max_splits=2 and drop_columns=False
>>> transformer = Split(subset=['full_name'], by=' ', max_splits=2, drop_columns=False) >>> transformer.fit(X) Split(subset=['full_name'], by=' ', max_splits=2, drop_columns=False) >>> result = transformer.transform(X) >>> result shape: (3, 3) ┌──────────────────────┬─────────────────────┬─────────────────────┐ │ full_name │ full_name__split_0 │ full_name__split_1 │ │ str │ str │ str │ ├──────────────────────┼─────────────────────┼─────────────────────┤ │ John Doe │ John │ Doe │ │ Jane Smith Williams │ Jane │ Smith │ │ Alice Johnson │ Alice │ Johnson │ └──────────────────────┴─────────────────────┴─────────────────────┘
- class gators.feature_generation_str.SplitExtract[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates features by splitting columns and extracting a specific part by index.
- Parameters:
Examples
>>> from gators.feature_generation_str import SplitExtract >>> import polars as pl
>>> X ={'full_name': ['John Doe', 'Jane Smith', 'Alice Johnson']} >>> X = pl.DataFrame(X)
Example 1: Extract first part (n=0)
>>> transformer = SplitExtract(subset=['full_name'], by=' ', n=0, drop_columns=True) >>> transformer.fit(X) SplitExtract(subset=['full_name'], by=' ', n=0, drop_columns=True) >>> result = transformer.transform(X) >>> result shape: (3, 1) ┌────────────────────┐ │ full_name__split_0 │ │ str │ ├────────────────────┤ │ John │ │ Jane │ │ Alice │ └────────────────────┘
Example 2: Extract second part (n=1)
>>> transformer = SplitExtract(subset=['full_name'], by=' ', n=1, drop_columns=False) >>> transformer.fit(X) SplitExtract(subset=['full_name'], by=' ', n=1, drop_columns=False) >>> result = transformer.transform(X) >>> result shape: (3, 2) ┌──────────────────┬────────────────────┐ │ full_name │ full_name__split_1 │ │ str │ str │ ├──────────────────┼────────────────────┤ │ John Doe │ Doe │ │ Jane Smith │ Smith │ │ Alice Johnson │ Johnson │ └──────────────────┴────────────────────┘
- class gators.feature_generation_str.Startswith[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinGenerates Boolean features to indicate if strings in the original columns start with specified substrings.
Examples
Create an instance of the Startswith class:
>>> from gators.feature_feneration_str import Startswith >>> startswith_dict = {'col1': ['pre1', 'pre2'], 'col2': ['pre3']} >>> transformer = Startswith(startswith_dict=startswith_dict)
Fit the transformer:
>>> transformer.fit(X)
Transform the DataFrame:
>>> X = pl.DataFrame({"col1": ["pre1_sample", None, "pre2_sample"], ... "col2": [None, "pre3_sample", "no_match"]}) >>> transformed_X = transformer.transform(X) >>> print(transformed_X) shape: (3, 5) ┌───────────────┬─────────────┬───────────────────────┬───────────────────────┬───────────────────────┐ │ col1 │ col2 │ col1__startswith_pre1 │ col1__startswith_pre2 │ col2__startswith_pre3 │ ├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤ │ pre1_sample │ None │ True │ False │ None │ ├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤ │ None │ pre3_sample │ None │ None │ True │ ├───────────────┼─────────────┼───────────────────────┼───────────────────────┼───────────────────────┤ │ pre2_sample │ no_match │ False │ True │ False │ └───────────────┴─────────────┴───────────────────────┴───────────────────────┴───────────────────────┘
- class gators.feature_generation_str.Upper[source]#
Bases:
BaseModel,BaseEstimator,TransformerMixinConverts string and Boolean columns to uppercase.
- Parameters:
Examples
>>> import polars as pl >>> from gators.discretizers import Upper
>>> # Sample data >>> X =pl.DataFrame({ ... 'A': ['cat', 'dog', 'cat', 'dog', 'cat'], ... 'B': ['yes', 'no', 'yes', 'no', 'yes'], ... 'C': ['quick', 'brown', 'fox', 'jumps', 'over'] ... })
>>> # Transform to uppercase with default parameters (drop_columns=True) >>> encoder = Upper() >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 3) ┌──────────┬──────────┬──────────┐ │ A__upper │ B__upper │ C__upper │ │ --- │ --- │ --- │ │ str │ str │ str │ ├──────────┼──────────┼──────────┤ │ CAT │ YES │ QUICK │ │ DOG │ NO │ BROWN │ │ CAT │ YES │ FOX │ │ DOG │ NO │ JUMPS │ │ CAT │ YES │ OVER │ └──────────┴──────────┴──────────┘
>>> # Transform to uppercase with drop_columns=False >>> encoder = Upper(drop_columns=False) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 6) ┌─────┬──────┬───────┬──────────┬──────────┬──────────┐ │ A │ B │ C │ A__upper │ B__upper │ C__upper │ │ --- │ --- │ --- │ --- │ --- │ --- │ │ str │ str │ str │ str │ str │ str │ ├─────┼──────┼───────┼──────────┼──────────┼──────────┤ │ cat │ yes │ quick │ CAT │ YES │ QUICK │ │ dog │ no │ brown │ DOG │ NO │ BROWN │ │ cat │ yes │ fox │ CAT │ YES │ FOX │ │ dog │ no │ jumps │ DOG │ NO │ JUMPS │ │ cat │ yes │ over │ CAT │ YES │ OVER │ └─────┴──────┴───────┴──────────┴──────────┴──────────┘
>>> # Transform to uppercase with columns as a subset >>> encoder = Upper(subset=['B']) >>> encoder.fit(X) >>> transformed_X =encoder.transform(X) >>> print(transformed_X) shape: (5, 4) ┌─────┬──────┬───────┬──────────┐ │ A │ B │ C │ B__upper │ │ --- │ --- │ --- │ --- │ │ str │ str │ str │ str │ ├─────┼──────┼───────┼──────────┤ │ cat │ yes │ quick │ YES │ │ dog │ no │ brown │ NO │ │ cat │ yes │ fox │ YES │ │ dog │ no │ jumps │ NO │ │ cat │ yes │ over │ YES │ └─────┴──────┴───────┴──────────┘