matchbench.model.entity_matching package¶
Submodules¶
matchbench.model.entity_matching.deepmatcher module¶
matchbench.model.entity_matching.ditto module¶
- class matchbench.model.entity_matching.ditto.DKInjector(name)¶
Bases:
object
Inject domain knowledge to the data entry pairs. .. attribute:: name
the injector name
- type:
str
- initialize()¶
- transform(entry)¶
- transform_file(input_fn, overwrite=False)¶
Transform all lines of a tsv file. Run the knowledge injector. If the output already exists, just return the file name. :param input_fn: the input file name :type input_fn: str :param overwrite: if true, then overwrite any cached output :type overwrite: bool, optional
- Returns:
the output file name
- Return type:
str
- class matchbench.model.entity_matching.ditto.Ditto(summarize=False, dk=None, da='drop_col', max_length=128, size=None, batch_size=32, device='cuda', pretrain_plm_path=None, tokenizer=None, encodermodel=None, alpha_aug=0.8)¶
Bases:
EMModel
The class for Ditto approach
- Class attributes:
Data/Feature generation summarize (bool): If the input sequence will be summarized by retaining only the high TF-IDF tokens or not. dk (str): If inject domain knowledge to the input sequences or not. da (str): If train the Ditto model with MixDA, supporting operators “del” “swap” “drop_col” “append_col” “all” max_length (int): Max sequence length. batch_size (int): The batch size. size (int): Maximum number for training.
Model architecture / Loss alpha_aug (double): The parameters for np.random.beta.
Others device (str): Device where data storage and model run plm_path (str): The local pretrained language model path. tokenizer (transformers.Tokenizer): User defined plm tokenizer. encodermodel (transformers.PretrainedModel): User defined plm model.
- calculate_loss(prediction, label)¶
Calculate the loss of the batch. :param prediction: The result predicted by model Ditto. :type prediction: torch.tensor :param label: The true label. :type label: torch.tensor
- Returns:
loss
- Return type:
torch.tensor
- forward(x1, x2=None)¶
Encode the left, right, and the concatenation of left+right. :param x1: A batch of ID’s :type x1: LongTensor :param x2: A batch of ID’s (augmented) :type x2: LongTensor, optional, defaults to None
- Returns:
Binary prediction.
- Return type:
Tensor
- load_source_target(data_src, data_tgt)¶
Prepare source and target data. :param data_src: Source dataset. :type data_src: datasets.dataset_dict.DatasetDict :param data_tgt: Target dataset. :type data_tgt: datasets.dataset_dict.DatasetDict
- predict(dataloader, split='test', threshold=-1)¶
Predict the results
- Parameters:
dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.
split (str, optional, defaults to “test”) – A string indicating which split the dataset is.
threshold (float, optional, defaults to -1) – The threshold on the 0-class
- Returns:
The true label. List : The result predicted by Ditto model.
- Return type:
List
- prepare_dataloader(dataset, split, batch_size=32)¶
Prepare dataloaders for training and evaluation.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
batch_size (int, optional, defaults to 32) – Batch_size for training or evaluating.
- Returns:
Return dataloader for train/valid/test.
- Return type:
tuple of torch.utils.data.Dataloader
- run_step(batch, device=device(type='cuda'))¶
The whole process training one batch (step). :param batch: :type batch: tuple of torch.tensor :param device: cuda or cpu. :type device: str, optional, defaults to “cuda”
- Returns:
loss
- Return type:
torch.tensor
- class matchbench.model.entity_matching.ditto.DittoDataLoader(data, lm='roberta', max_len=256, size=None, da=None)¶
Bases:
Dataset
Ditto dataset
- Class Attributes:
data (List of str): Data. lm (str): The name of Pre-train language model. max_len (int): Max sequence length. size (int): Maximum number for training. da (str): If train the Ditto model with MixDA, supporting operators “del” “swap” “drop_col” “append_col” “all”.
- static pad(batch)¶
Merge a list of dataset items into a train/test batch :param batch: a list of dataset items :type batch: List of tuple
- Returns:
x1 of shape (batch_size, seq_len) LongTensor: x2 of shape (batch_size, seq_len).
Elements of x1 and x2 are padded to the same length
LongTensor: a batch of labels, (batch_size,)
- Return type:
LongTensor
- class matchbench.model.entity_matching.ditto.GeneralDKInjector(name)¶
Bases:
DKInjector
The domain-knowledge injector for publication and business data.
- initialize()¶
Initialize spacy
- transform(entry)¶
Transform a data entry. Use NER to regconize the product-related named entities and mark them in the sequence. Normalize the numbers into the same format. :param entry: the serialized data entry :type entry: str
- Returns:
the transformed entry
- Return type:
str
- class matchbench.model.entity_matching.ditto.ProductDKInjector(name)¶
Bases:
DKInjector
The domain-knowledge injector for product data.
- initialize()¶
Initialize spacy
- transform(entry)¶
Transform a data entry. Use NER to regconize the product-related named entities and mark them in the sequence. Normalize the numbers into the same format. :param entry: the serialized data entry :type entry: str
- Returns:
the transformed entry
- Return type:
str
- class matchbench.model.entity_matching.ditto.Summarizer(data, lm)¶
Bases:
object
To summarize a data entry pair into length up to the max sequence length. :param task_config: the task configuration :type task_config: Dictionary :param lm: the language model (bert, albert, or distilbert) :type lm: str
- config¶
the task configuration
- Type:
Dictionary
- tokenizer¶
a tokenizer from the huggingface library
- Type:
Tokenizer
- build_index()¶
Build the idf index. Store the index and vocabulary in self.idf and self.vocab.
- get_len(word)¶
Return the sentence_piece length of a token.
- transform(row, max_len=128)¶
Summarize one single example. Only retain tokens of the highest tf-idf :param row: a matching example of two data entries and a binary label, separated by tab :type row: str :param max_len: the maximum sequence length to be summarized to :type max_len: int, optional
- Returns:
the summarized example
- Return type:
str
- transform_file(input_fn, max_len=256, overwrite=False)¶
Summarize all lines of a tsv file. Run the summarizer. If the output already exists, just return the file name. :param input_fn: the input file name :type input_fn: str :param max_len: the max sequence len :type max_len: int, optional :param overwrite: if true, then overwrite any cached output :type overwrite: bool, optional
- Returns:
the output file name
- Return type:
str
matchbench.model.entity_matching.jointbert module¶
- matchbench.model.entity_matching.jointbert.BCEWithLogitsLoss(output, target, pos_neg_ratio=None)¶
- class matchbench.model.entity_matching.jointbert.BertDataLoaderJoint(batch_size, file, valid_file=None, valid_batch_size=None, shuffle=True, validation_split=-1, num_workers=0, tokenizer_name='bert-base-uncased', max_length=512, mlm=False)¶
Bases:
DataLoader
DataLoader class for JointBert approach. Class Attributes:
batch_size (int): The batch size. file (pandas.DataFrame): The dataset. valid_file (List): The valid ids. valid_batch_size (int): The valid batch size. shuffle (bool, defaults to True): If set shuffle`=True, internally the RandomSampler will be used, which just permutes the indices of all samples. validation_split (`int, defaults to -1): If need to split dataset to get valid or not. num_workers (int): The number of subprocesses that DataLoader instance will use to load data. tokenizer_name (str): The name of PLM tokenizer. max_length (int): Max sequence length. mlm (bool): If use masked-LM training in preparing masked tokens or not.
- dataset: Dataset[T_co]¶
- sampler: Sampler¶
- split_validation()¶
Get valid DataLoader. :returns: The valid DataLoader. :rtype: torch.utils.data.DataLoader
- class matchbench.model.entity_matching.jointbert.DataCollatorWithPadding(tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], padding: Union[bool, str, PaddingStrategy] = True, max_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None, mlm: bool = False, mlm_probability: float = 0.15)¶
Bases:
object
Data collator that will dynamically pad the inputs received. Class Attributes:
tokenizer (
PreTrainedTokenizer
orPreTrainedTokenizerFast
): The tokenizer used for encoding the data. padding (bool
,str
orPaddingStrategy
, optional, defaults toTrue
):Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among: *
True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only asingle sequence if provided).
'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (
int
, optional): Maximum length of the returned list and optionally padding length (see above). pad_to_multiple_of (int
, optional):If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
- class matchbench.model.entity_matching.jointbert.JointBert(pos_neg_ratio=8, device='cuda', weight_decay=0.01, num_classes_multi=1013, num_classes=1, freeze_bert=False)¶
Bases:
EMModel
The class for JointBert approach
- Class attributes:
Data/Feature generation pos_neg_ratio (int): The ratio of neg/pos.
Model architecture / Loss weight_decay (float): Weight decay. num_classes_multi (int): The output number of entity identifier classification layer. num_classes (int): The output number of matching classification layer.
Others device (str): Device where data storage and model run. freeze_bert (bool): If freeze the weights of bert model or not.
- calc_weights_label(trainloader)¶
Calc weights for entity identifier loss function. :param trainloader: Train data loader. :type trainloader: BertDataLoaderJoint
- forward(seq, token_ids, attn_masks)¶
- Parameters:
seq (Tensor) – The sequences.
token_ids (Tensor) – Tensor of shape [B, T] containing token ids of sequences.
attn_masks (Tensor) – Tensor of shape [B, T] containing attention masks to be used to avoid contibution of PAD tokens.
- Returns:
The probability of the positive class(match). Tensor : The entity identifier of the left entity descriptions. Tensor : The entity identifier of the right entity descriptions.
- Return type:
Tensor
- pair_generator(tableA, tableB, train, valid, test)¶
Generate train/valid/test data loader. :param data_src: Source dataset. :type data_src: datasets.dataset_dict.DatasetDict :param data_tgt: Target dataset. :type data_tgt: datasets.dataset_dict.DatasetDict :param train: Train pairs. :type train: datasets.arrow_dataset.Dataset :param valid: Valid pairs. :type valid: datasets.arrow_dataset.Dataset :param test: test pairs. :type test: datasets.arrow_dataset.Dataset
- Returns:
Train data loader. torch.utils.data.DataLoader : Valid data loader. BertDataLoaderJoint : Test data loader.
- Return type:
BertDataLoaderJoint
- predict(dataloader, split='test')¶
Predict the results
- Parameters:
dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.
split (str, optional, defaults to “test”) – A string indicating which split the dataset is.
threshold (float, optional, defaults to -1) – the threshold on the 0-class
- Returns:
The true label. List : The result predicted by JointBert model.
- Return type:
List
- prepare_dataloader(dataset, split, batch_size=32)¶
Prepare dataloaders for training and evaluation.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
batch_size (int, optional, defaults to 32) – Batch_size for training or evaluating.
- Returns:
Return dataloader for train/valid/test.
- Return type:
tuple of torch.utils.data.Dataloader
- run_step(batch, device)¶
The whole process training one batch (step). :param batch: :type batch: tuple of torch.tensor :param device: cuda or cpu. :type device: str, optional, defaults to “cuda”
- Returns:
loss
- Return type:
torch.tensor
- class matchbench.model.entity_matching.jointbert.JointBertPrep(data, tokenizer_name, max_length)¶
Bases:
Dataset
Prepare data for JointBert. Class Attributes:
data (pandas.DataFrame): The data needed to prepare. tokenizer_name (str): The name of PLM tokenizer. max_length (int): Max sequence length.
- matchbench.model.entity_matching.jointbert.assign_clusterid(identifier, cluster_id_dict, cluster_id_amount)¶
- matchbench.model.entity_matching.jointbert.cluster_id_process(tableA, tableB, train, valid, test)¶
Get the cluster ids for entities in train/valid/test dataset. :param data_src: Source dataset. :type data_src: datasets.dataset_dict.DatasetDict :param data_tgt: Target dataset. :type data_tgt: datasets.dataset_dict.DatasetDict :param train: Train pairs. :type train: datasets.arrow_dataset.Dataset :param valid: Valid pairs. :type valid: datasets.arrow_dataset.Dataset :param test: test pairs. :type test: datasets.arrow_dataset.Dataset
- Returns:
Train dataset with cluster ids. List of str : Pair ids for valid dataset. pandas.DataFrame : Test dataset with cluster ids.
- Return type:
pandas.DataFrame
- matchbench.model.entity_matching.jointbert.get_cluster_id(pairs, cluster_id_dict, cluster_id_amount, left_df, right_df)¶
- matchbench.model.entity_matching.jointbert.get_encoder_deepmatcher(train, test)¶
- matchbench.model.entity_matching.jointbert.process_to_bert(dataset, cutting_func=None, multi_encoder=None)¶
matchbench.model.entity_matching.robem module¶
- class matchbench.model.entity_matching.robem.ASLSingleLabel(gamma_pos=0, gamma_neg=4, eps: float = 0.1, reduction='mean')¶
Bases:
Module
- forward(inputs, target)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_matching.robem.AugMode(value)¶
Bases:
Enum
An enumeration.
- RANDOM_DEL = 0¶
- REMOVE_CHAR_LESS_THAN = 2¶
- REMOVE_STOP_WORD = 1¶
- ROB_ALL = 3¶
- ROB_SFL = 5¶
- ROB_SWAP = 4¶
- class matchbench.model.entity_matching.robem.BasicAug(mode: AugMode = AugMode.RANDOM_DEL, **kwargs)¶
Bases:
object
- augment(text, num=1)¶
- class matchbench.model.entity_matching.robem.EmDataset(datasrc, datatgt, match, mode='train', transform=None, device=None, sentence_size=256, aug_size=1)¶
Bases:
Dataset
Robem Dataset. Class attributes:
datasrc (pandas.DataFrame): Source dataset. datatgt (pandas.DataFrame): Target dataset. match (datasets.arrow_dataset.Dataset): The matching relationship for pairs. mode (str): Mode for data preparation, the options are train,`valid`,`test`. transform (List of tokenizer): The tokenizers used in data preparation, such as transformers.BertTokenizer and transformers.RobertaTokenizer. device (str): Device where data storage and model run. sentence_size (int): Max sentence size. aug_size (int): The number of augmentation data corresponding to a piece of data.
- input_serializer(left, left_full, right, right_full)¶
- class matchbench.model.entity_matching.robem.Highway(input_size)¶
Bases:
Module
- forward(input)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_matching.robem.RobEM(device='cuda', dataprep=None, pretrained_lm='roberta-base', sent_size=256, deep_classifier=False, loss_type='wce', neg_weight=0.2, pos_weight=0.8, amp_scaler=True, addsep=False)¶
Bases:
EMModel
The class for Robem approach
- Class attributes:
Data/Feature generation dataprep (torch.utils.data.Dataset or his subclass): The Dataset for Robem data preparation. sent_size (int): Max sequence length.
Model architecture / Loss deep_classifier (bool): If enable deep classifier or not. loss_type (str): The name of loss function for RobEM approach. neg_weight (float): Wce & asl loss weight for non-match samples. pos_weight (float): Wce & asl loss weight for match samples. amp_scaler (bool): If use auto mixed precision or not. addsep (bool): If add attribute separator as special token for to tokenizer and model(LM) or not.
Others device (str): Device where data storage and model run. pretrained_lm (str): The name of PLM.
- context_forward(x)¶
- context_similarity_layers(deep=False)¶
- forward(x1, x2, concat)¶
“Forward function of the Robem. :param x1: The left. :type x1: Tensor :param x2: The right. :type x2: Tensor :param concat: The concatenation of left and right. :type concat: Tensor
- Returns:
Binary prediction.
- Return type:
Tensor
- get_lm()¶
- get_lm_class()¶
- get_lm_dim()¶
- static get_tokenizers(lm, add_special_token=True)¶
- has_type_token()¶
- load_source_target(data_src, data_tgt)¶
Prepare source and target data. :param data_src: Source dataset. :type data_src: datasets.dataset_dict.DatasetDict :param data_tgt: Target dataset. :type data_tgt: datasets.dataset_dict.DatasetDict
- predict(dataloader, split='test', threshold=-1)¶
Predict the results.
- Parameters:
dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.
split (str, optional, defaults to “test”) – A string indicating which split the dataset is.
threshold (float, optional, defaults to -1) – The threshold on the 0-class
- Returns:
The true label. List : The result predicted by Robem model.
- Return type:
List
- prepare_dataloader(dataset, split, batch_size=32)¶
Prepare dataloaders for training and evaluation.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
batch_size (int, optional, defaults to 32) – Batch_size for training or evaluating.
- Returns:
Return dataloader for train/valid/test.
- Return type:
tuple of torch.utils.data.Dataloader
- reset_weights()¶
- resize_embedding(module, new_len)¶
- run_step(batch, device=device(type='cuda'))¶
The whole process training one batch (step). :param batch: :type batch: tuple of torch.tensor :param device: cuda or cpu. :type device: str, optional, defaults to “cuda”
- Returns:
loss
- Return type:
torch.tensor
- class matchbench.model.entity_matching.robem.RobertaClassificationHead(input_size)¶
Bases:
Module
Head for sentence-level classification tasks.
- forward(x)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_matching.robem.RobustAugmenter(size, cols, sep_key=' [SEP] ')¶
Bases:
object
- class matchbench.model.entity_matching.robem.SimpleClassifier(hidden_size)¶
Bases:
Module
- forward(enc)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_matching.robem.Summarizer¶
Bases:
object
Class for summarizer.
- summarize(text, max_len=256)¶
To summarize text into length up to the max sequence length. :param text: The text need to summarize. :type text: str :param max_len: Max sequence length. :type max_len: int
- Returns:
The result after summarizing.
- Return type:
str
- matchbench.model.entity_matching.robem.cosine_similarity(x1, x2, dim=1, eps=1e-08) Tensor ¶
Returns cosine similarity between x1 and x2, computed along dim.
\[\text{similarity} = \dfrac{x_1 \cdot x_2}{\max(\Vert x_1 \Vert _2 \cdot \Vert x_2 \Vert _2, \epsilon)}\]- Parameters:
- Shape:
Input: \((\ast_1, D, \ast_2)\) where D is at position dim.
Output: \((\ast_1, \ast_2)\) where 1 is at position dim.
Example:
>>> input1 = torch.randn(100, 128) >>> input2 = torch.randn(100, 128) >>> output = F.cosine_similarity(input1, input2) >>> print(output)
- matchbench.model.entity_matching.robem.set_to_device(x, device)¶