matchbench.model.entity_matching package

Submodules

matchbench.model.entity_matching.deepmatcher module

matchbench.model.entity_matching.ditto module

class matchbench.model.entity_matching.ditto.DKInjector(name)

Bases: object

Inject domain knowledge to the data entry pairs. .. attribute:: name

the injector name

type:

str

initialize()
transform(entry)
transform_file(input_fn, overwrite=False)

Transform all lines of a tsv file. Run the knowledge injector. If the output already exists, just return the file name. :param input_fn: the input file name :type input_fn: str :param overwrite: if true, then overwrite any cached output :type overwrite: bool, optional

Returns:

the output file name

Return type:

str

class matchbench.model.entity_matching.ditto.Ditto(summarize=False, dk=None, da='drop_col', max_length=128, size=None, batch_size=32, device='cuda', pretrain_plm_path=None, tokenizer=None, encodermodel=None, alpha_aug=0.8)

Bases: EMModel

The class for Ditto approach

Class attributes:

Data/Feature generation summarize (bool): If the input sequence will be summarized by retaining only the high TF-IDF tokens or not. dk (str): If inject domain knowledge to the input sequences or not. da (str): If train the Ditto model with MixDA, supporting operators “del” “swap” “drop_col” “append_col” “all” max_length (int): Max sequence length. batch_size (int): The batch size. size (int): Maximum number for training.

Model architecture / Loss alpha_aug (double): The parameters for np.random.beta.

Others device (str): Device where data storage and model run plm_path (str): The local pretrained language model path. tokenizer (transformers.Tokenizer): User defined plm tokenizer. encodermodel (transformers.PretrainedModel): User defined plm model.

calculate_loss(prediction, label)

Calculate the loss of the batch. :param prediction: The result predicted by model Ditto. :type prediction: torch.tensor :param label: The true label. :type label: torch.tensor

Returns:

loss

Return type:

torch.tensor

forward(x1, x2=None)

Encode the left, right, and the concatenation of left+right. :param x1: A batch of ID’s :type x1: LongTensor :param x2: A batch of ID’s (augmented) :type x2: LongTensor, optional, defaults to None

Returns:

Binary prediction.

Return type:

Tensor

load_source_target(data_src, data_tgt)

Prepare source and target data. :param data_src: Source dataset. :type data_src: datasets.dataset_dict.DatasetDict :param data_tgt: Target dataset. :type data_tgt: datasets.dataset_dict.DatasetDict

predict(dataloader, split='test', threshold=-1)

Predict the results

Parameters:
  • dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.

  • split (str, optional, defaults to “test”) – A string indicating which split the dataset is.

  • threshold (float, optional, defaults to -1) – The threshold on the 0-class

Returns:

The true label. List : The result predicted by Ditto model.

Return type:

List

prepare_dataloader(dataset, split, batch_size=32)

Prepare dataloaders for training and evaluation.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • batch_size (int, optional, defaults to 32) – Batch_size for training or evaluating.

Returns:

Return dataloader for train/valid/test.

Return type:

tuple of torch.utils.data.Dataloader

run_step(batch, device=device(type='cuda'))

The whole process training one batch (step). :param batch: :type batch: tuple of torch.tensor :param device: cuda or cpu. :type device: str, optional, defaults to “cuda”

Returns:

loss

Return type:

torch.tensor

training: bool
class matchbench.model.entity_matching.ditto.DittoDataLoader(data, lm='roberta', max_len=256, size=None, da=None)

Bases: Dataset

Ditto dataset

Class Attributes:

data (List of str): Data. lm (str): The name of Pre-train language model. max_len (int): Max sequence length. size (int): Maximum number for training. da (str): If train the Ditto model with MixDA, supporting operators “del” “swap” “drop_col” “append_col” “all”.

static pad(batch)

Merge a list of dataset items into a train/test batch :param batch: a list of dataset items :type batch: List of tuple

Returns:

x1 of shape (batch_size, seq_len) LongTensor: x2 of shape (batch_size, seq_len).

Elements of x1 and x2 are padded to the same length

LongTensor: a batch of labels, (batch_size,)

Return type:

LongTensor

class matchbench.model.entity_matching.ditto.GeneralDKInjector(name)

Bases: DKInjector

The domain-knowledge injector for publication and business data.

initialize()

Initialize spacy

transform(entry)

Transform a data entry. Use NER to regconize the product-related named entities and mark them in the sequence. Normalize the numbers into the same format. :param entry: the serialized data entry :type entry: str

Returns:

the transformed entry

Return type:

str

class matchbench.model.entity_matching.ditto.ProductDKInjector(name)

Bases: DKInjector

The domain-knowledge injector for product data.

initialize()

Initialize spacy

transform(entry)

Transform a data entry. Use NER to regconize the product-related named entities and mark them in the sequence. Normalize the numbers into the same format. :param entry: the serialized data entry :type entry: str

Returns:

the transformed entry

Return type:

str

class matchbench.model.entity_matching.ditto.Summarizer(data, lm)

Bases: object

To summarize a data entry pair into length up to the max sequence length. :param task_config: the task configuration :type task_config: Dictionary :param lm: the language model (bert, albert, or distilbert) :type lm: str

config

the task configuration

Type:

Dictionary

tokenizer

a tokenizer from the huggingface library

Type:

Tokenizer

build_index()

Build the idf index. Store the index and vocabulary in self.idf and self.vocab.

get_len(word)

Return the sentence_piece length of a token.

transform(row, max_len=128)

Summarize one single example. Only retain tokens of the highest tf-idf :param row: a matching example of two data entries and a binary label, separated by tab :type row: str :param max_len: the maximum sequence length to be summarized to :type max_len: int, optional

Returns:

the summarized example

Return type:

str

transform_file(input_fn, max_len=256, overwrite=False)

Summarize all lines of a tsv file. Run the summarizer. If the output already exists, just return the file name. :param input_fn: the input file name :type input_fn: str :param max_len: the max sequence len :type max_len: int, optional :param overwrite: if true, then overwrite any cached output :type overwrite: bool, optional

Returns:

the output file name

Return type:

str

matchbench.model.entity_matching.jointbert module

matchbench.model.entity_matching.jointbert.BCEWithLogitsLoss(output, target, pos_neg_ratio=None)
class matchbench.model.entity_matching.jointbert.BertDataLoaderJoint(batch_size, file, valid_file=None, valid_batch_size=None, shuffle=True, validation_split=-1, num_workers=0, tokenizer_name='bert-base-uncased', max_length=512, mlm=False)

Bases: DataLoader

DataLoader class for JointBert approach. Class Attributes:

batch_size (int): The batch size. file (pandas.DataFrame): The dataset. valid_file (List): The valid ids. valid_batch_size (int): The valid batch size. shuffle (bool, defaults to True): If set shuffle`=True, internally the RandomSampler will be used, which just permutes the indices of all samples. validation_split (`int, defaults to -1): If need to split dataset to get valid or not. num_workers (int): The number of subprocesses that DataLoader instance will use to load data. tokenizer_name (str): The name of PLM tokenizer. max_length (int): Max sequence length. mlm (bool): If use masked-LM training in preparing masked tokens or not.

batch_size: Optional[int]
dataset: Dataset[T_co]
drop_last: bool
num_workers: int
pin_memory: bool
prefetch_factor: int
sampler: Sampler
split_validation()

Get valid DataLoader. :returns: The valid DataLoader. :rtype: torch.utils.data.DataLoader

timeout: float
class matchbench.model.entity_matching.jointbert.DataCollatorWithPadding(tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], padding: Union[bool, str, PaddingStrategy] = True, max_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None, mlm: bool = False, mlm_probability: float = 0.15)

Bases: object

Data collator that will dynamically pad the inputs received. Class Attributes:

tokenizer (PreTrainedTokenizer or PreTrainedTokenizerFast): The tokenizer used for encoding the data. padding (bool, str or PaddingStrategy, optional, defaults to True):

Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among: * True or 'longest': Pad to the longest sequence in the batch (or no padding if only a

single sequence if provided).

  • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.

  • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

max_length (int, optional): Maximum length of the returned list and optionally padding length (see above). pad_to_multiple_of (int, optional):

If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).

mask_tokens(inputs: Tensor) Tuple[Tensor, Tensor]

Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.

max_length: Optional[int] = None
mlm: bool = False
mlm_probability: float = 0.15
pad_to_multiple_of: Optional[int] = None
padding: Union[bool, str, PaddingStrategy] = True
tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
class matchbench.model.entity_matching.jointbert.JointBert(pos_neg_ratio=8, device='cuda', weight_decay=0.01, num_classes_multi=1013, num_classes=1, freeze_bert=False)

Bases: EMModel

The class for JointBert approach

Class attributes:

Data/Feature generation pos_neg_ratio (int): The ratio of neg/pos.

Model architecture / Loss weight_decay (float): Weight decay. num_classes_multi (int): The output number of entity identifier classification layer. num_classes (int): The output number of matching classification layer.

Others device (str): Device where data storage and model run. freeze_bert (bool): If freeze the weights of bert model or not.

calc_weights_label(trainloader)

Calc weights for entity identifier loss function. :param trainloader: Train data loader. :type trainloader: BertDataLoaderJoint

forward(seq, token_ids, attn_masks)
Parameters:
  • seq (Tensor) – The sequences.

  • token_ids (Tensor) – Tensor of shape [B, T] containing token ids of sequences.

  • attn_masks (Tensor) – Tensor of shape [B, T] containing attention masks to be used to avoid contibution of PAD tokens.

Returns:

The probability of the positive class(match). Tensor : The entity identifier of the left entity descriptions. Tensor : The entity identifier of the right entity descriptions.

Return type:

Tensor

pair_generator(tableA, tableB, train, valid, test)

Generate train/valid/test data loader. :param data_src: Source dataset. :type data_src: datasets.dataset_dict.DatasetDict :param data_tgt: Target dataset. :type data_tgt: datasets.dataset_dict.DatasetDict :param train: Train pairs. :type train: datasets.arrow_dataset.Dataset :param valid: Valid pairs. :type valid: datasets.arrow_dataset.Dataset :param test: test pairs. :type test: datasets.arrow_dataset.Dataset

Returns:

Train data loader. torch.utils.data.DataLoader : Valid data loader. BertDataLoaderJoint : Test data loader.

Return type:

BertDataLoaderJoint

predict(dataloader, split='test')

Predict the results

Parameters:
  • dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.

  • split (str, optional, defaults to “test”) – A string indicating which split the dataset is.

  • threshold (float, optional, defaults to -1) – the threshold on the 0-class

Returns:

The true label. List : The result predicted by JointBert model.

Return type:

List

prepare_dataloader(dataset, split, batch_size=32)

Prepare dataloaders for training and evaluation.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • batch_size (int, optional, defaults to 32) – Batch_size for training or evaluating.

Returns:

Return dataloader for train/valid/test.

Return type:

tuple of torch.utils.data.Dataloader

run_step(batch, device)

The whole process training one batch (step). :param batch: :type batch: tuple of torch.tensor :param device: cuda or cpu. :type device: str, optional, defaults to “cuda”

Returns:

loss

Return type:

torch.tensor

training: bool
class matchbench.model.entity_matching.jointbert.JointBertPrep(data, tokenizer_name, max_length)

Bases: Dataset

Prepare data for JointBert. Class Attributes:

data (pandas.DataFrame): The data needed to prepare. tokenizer_name (str): The name of PLM tokenizer. max_length (int): Max sequence length.

matchbench.model.entity_matching.jointbert.assign_clusterid(identifier, cluster_id_dict, cluster_id_amount)
matchbench.model.entity_matching.jointbert.cluster_id_process(tableA, tableB, train, valid, test)

Get the cluster ids for entities in train/valid/test dataset. :param data_src: Source dataset. :type data_src: datasets.dataset_dict.DatasetDict :param data_tgt: Target dataset. :type data_tgt: datasets.dataset_dict.DatasetDict :param train: Train pairs. :type train: datasets.arrow_dataset.Dataset :param valid: Valid pairs. :type valid: datasets.arrow_dataset.Dataset :param test: test pairs. :type test: datasets.arrow_dataset.Dataset

Returns:

Train dataset with cluster ids. List of str : Pair ids for valid dataset. pandas.DataFrame : Test dataset with cluster ids.

Return type:

pandas.DataFrame

matchbench.model.entity_matching.jointbert.get_cluster_id(pairs, cluster_id_dict, cluster_id_amount, left_df, right_df)
matchbench.model.entity_matching.jointbert.get_encoder_deepmatcher(train, test)
matchbench.model.entity_matching.jointbert.process_to_bert(dataset, cutting_func=None, multi_encoder=None)

matchbench.model.entity_matching.robem module

class matchbench.model.entity_matching.robem.ASLSingleLabel(gamma_pos=0, gamma_neg=4, eps: float = 0.1, reduction='mean')

Bases: Module

forward(inputs, target)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_matching.robem.AugMode(value)

Bases: Enum

An enumeration.

RANDOM_DEL = 0
REMOVE_CHAR_LESS_THAN = 2
REMOVE_STOP_WORD = 1
ROB_ALL = 3
ROB_SFL = 5
ROB_SWAP = 4
class matchbench.model.entity_matching.robem.BasicAug(mode: AugMode = AugMode.RANDOM_DEL, **kwargs)

Bases: object

augment(text, num=1)
class matchbench.model.entity_matching.robem.EmDataset(datasrc, datatgt, match, mode='train', transform=None, device=None, sentence_size=256, aug_size=1)

Bases: Dataset

Robem Dataset. Class attributes:

datasrc (pandas.DataFrame): Source dataset. datatgt (pandas.DataFrame): Target dataset. match (datasets.arrow_dataset.Dataset): The matching relationship for pairs. mode (str): Mode for data preparation, the options are train,`valid`,`test`. transform (List of tokenizer): The tokenizers used in data preparation, such as transformers.BertTokenizer and transformers.RobertaTokenizer. device (str): Device where data storage and model run. sentence_size (int): Max sentence size. aug_size (int): The number of augmentation data corresponding to a piece of data.

input_serializer(left, left_full, right, right_full)
class matchbench.model.entity_matching.robem.Highway(input_size)

Bases: Module

forward(input)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_matching.robem.RobEM(device='cuda', dataprep=None, pretrained_lm='roberta-base', sent_size=256, deep_classifier=False, loss_type='wce', neg_weight=0.2, pos_weight=0.8, amp_scaler=True, addsep=False)

Bases: EMModel

The class for Robem approach

Class attributes:

Data/Feature generation dataprep (torch.utils.data.Dataset or his subclass): The Dataset for Robem data preparation. sent_size (int): Max sequence length.

Model architecture / Loss deep_classifier (bool): If enable deep classifier or not. loss_type (str): The name of loss function for RobEM approach. neg_weight (float): Wce & asl loss weight for non-match samples. pos_weight (float): Wce & asl loss weight for match samples. amp_scaler (bool): If use auto mixed precision or not. addsep (bool): If add attribute separator as special token for to tokenizer and model(LM) or not.

Others device (str): Device where data storage and model run. pretrained_lm (str): The name of PLM.

context_forward(x)
context_similarity_layers(deep=False)
forward(x1, x2, concat)

“Forward function of the Robem. :param x1: The left. :type x1: Tensor :param x2: The right. :type x2: Tensor :param concat: The concatenation of left and right. :type concat: Tensor

Returns:

Binary prediction.

Return type:

Tensor

get_lm()
get_lm_class()
get_lm_dim()
static get_tokenizers(lm, add_special_token=True)
has_type_token()
load_source_target(data_src, data_tgt)

Prepare source and target data. :param data_src: Source dataset. :type data_src: datasets.dataset_dict.DatasetDict :param data_tgt: Target dataset. :type data_tgt: datasets.dataset_dict.DatasetDict

predict(dataloader, split='test', threshold=-1)

Predict the results.

Parameters:
  • dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.

  • split (str, optional, defaults to “test”) – A string indicating which split the dataset is.

  • threshold (float, optional, defaults to -1) – The threshold on the 0-class

Returns:

The true label. List : The result predicted by Robem model.

Return type:

List

prepare_dataloader(dataset, split, batch_size=32)

Prepare dataloaders for training and evaluation.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • batch_size (int, optional, defaults to 32) – Batch_size for training or evaluating.

Returns:

Return dataloader for train/valid/test.

Return type:

tuple of torch.utils.data.Dataloader

reset_weights()
resize_embedding(module, new_len)
run_step(batch, device=device(type='cuda'))

The whole process training one batch (step). :param batch: :type batch: tuple of torch.tensor :param device: cuda or cpu. :type device: str, optional, defaults to “cuda”

Returns:

loss

Return type:

torch.tensor

training: bool
class matchbench.model.entity_matching.robem.RobertaClassificationHead(input_size)

Bases: Module

Head for sentence-level classification tasks.

forward(x)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_matching.robem.RobustAugmenter(size, cols, sep_key=' [SEP] ')

Bases: object

augment_sent(text, op: AugMode = AugMode.ROB_ALL, fixed_shuffler=True, force_swap=False, attr_key=' ATTR ', disable_shuffle=False, disable_swap=False)
class matchbench.model.entity_matching.robem.SimpleClassifier(hidden_size)

Bases: Module

forward(enc)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_matching.robem.Summarizer

Bases: object

Class for summarizer.

summarize(text, max_len=256)

To summarize text into length up to the max sequence length. :param text: The text need to summarize. :type text: str :param max_len: Max sequence length. :type max_len: int

Returns:

The result after summarizing.

Return type:

str

matchbench.model.entity_matching.robem.cosine_similarity(x1, x2, dim=1, eps=1e-08) Tensor

Returns cosine similarity between x1 and x2, computed along dim.

\[\text{similarity} = \dfrac{x_1 \cdot x_2}{\max(\Vert x_1 \Vert _2 \cdot \Vert x_2 \Vert _2, \epsilon)}\]
Parameters:
  • x1 (Tensor) – First input.

  • x2 (Tensor) – Second input (of size matching x1).

  • dim (int, optional) – Dimension of vectors. Default: 1

  • eps (float, optional) – Small value to avoid division by zero. Default: 1e-8

Shape:
  • Input: \((\ast_1, D, \ast_2)\) where D is at position dim.

  • Output: \((\ast_1, \ast_2)\) where 1 is at position dim.

Example:

>>> input1 = torch.randn(100, 128)
>>> input2 = torch.randn(100, 128)
>>> output = F.cosine_similarity(input1, input2)
>>> print(output)
matchbench.model.entity_matching.robem.set_to_device(x, device)

matchbench.model.entity_matching.rotom module

Module contents