matchbench.model.column_type_annotation package¶

Submodules¶

matchbench.model.column_type_annotation.doduo module¶

class matchbench.model.column_type_annotation.doduo.Doduo(max_length: int = 8, label_num: int = 78, multi_label: bool = False, hidden_dropout_probability: float = 0.5, pretrained_model_name: str = 'bert-base-uncased', loss_function=<class 'torch.nn.modules.loss.CrossEntropyLoss'>)¶

Bases: CTAModel

forward(data: Tensor, index: Tensor)¶: # self.encoder: PLM serialization -> feature vector # self.matcher: feature vector -> logits for PLMs, encode + match for PWEs, match

label_mapping = {}¶

predict(batch, device=device(type='cpu'))¶: Args: batch data Returns: prediction

prepare_dataloader(dataset, batch_size=16, split='train')¶: for PLMs, serialize for PWEs, encode

run_step(batch, device=device(type='cpu'))¶: Args: batch data Returns: loss

tokenize(tables: List[DataFrame])¶

training: bool¶

matchbench.model.column_type_annotation.sato module¶

class matchbench.model.column_type_annotation.sato.CRF(label_num, batch_first)¶

Bases: Module

decode(emissions: Tensor, mask: Optional[ByteTensor] = None) → List[List[int]]¶

Find the most likely tag sequence using Viterbi algorithm.

Parameters:

emissions (~torch.Tensor) – Emission score tensor of size (seq_length, batch_size, label_num) if batch_first is False, (batch_size, seq_length, label_num) otherwise.
mask (~torch.ByteTensor) – Mask tensor of size (seq_length, batch_size) if batch_first is False, (batch_size, seq_length) otherwise.

Returns:

List of list containing the best tag sequence for each batch.

forward(emissions: Tensor, tags: LongTensor, mask: Optional[ByteTensor] = None, reduction: str = 'sum') → Tensor¶

Compute the conditional log likelihood of a sequence of tags given emission scores.

Parameters:

emissions (~torch.Tensor) – Emission score tensor of size (seq_length, batch_size, label_num) if batch_first is False, (batch_size, seq_length, label_num) otherwise.
tags (~torch.LongTensor) – Sequence of tags tensor of size (seq_length, batch_size) if batch_first is False, (batch_size, seq_length) otherwise.
mask (~torch.ByteTensor) – Mask tensor of size (seq_length, batch_size) if batch_first is False, (batch_size, seq_length) otherwise.
reduction – Specifies the reduction to apply to the output: none|sum|mean|token_mean. none: no reduction will be applied. sum: the output will be summed over batches. mean: the output will be averaged over batches. token_mean: the output will be averaged over tokens.

Returns:

The log likelihood. This will have size (batch_size,) if reduction is none, () otherwise.

Return type:

~torch.Tensor

init_transition(tables)¶

reset_parameters()¶

training: bool¶

class matchbench.model.column_type_annotation.sato.Sato(fixed_sherlock_params: bool = True, batch_first: bool = True, sherlock_model: ~matchbench.model.column_type_annotation.sherlock.Sherlock = Sherlock( (loss): CrossEntropyLoss() (FeatureEncoder_charactor): _FeatureEncoder( (bn1): BatchNorm1d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=960, out_features=300, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=300, out_features=300, bias=True) (relu2): ReLU() ) (FeatureEncoder_word): _FeatureEncoder( (bn1): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=200, out_features=200, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=200, out_features=200, bias=True) (relu2): ReLU() ) (FeatureEncoder_paragraph): _FeatureEncoder( (bn1): BatchNorm1d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=400, out_features=400, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=400, out_features=400, bias=True) (relu2): ReLU() ) (FeatureEncoder_global): _FeatureEncoder( (bn1): BatchNorm1d(27, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=27, out_features=27, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=27, out_features=27, bias=True) (relu2): ReLU() ) (bn1): BatchNorm1d(927, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=927, out_features=500, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=500, out_features=500, bias=True) (relu2): ReLU() (linear3): Linear(in_features=500, out_features=78, bias=True) ))¶

Bases: CTAModel

label_mapping = {}¶

parameters()¶

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Parameters:: recurse (bool) – if True, then yields parameters of this module and all submodules. Otherwise, yields only parameters that are direct members of this module.
Yields:: Parameter – module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

predict(batch, device=device(type='cpu'))¶: Args: batch data Returns: prediction

prepare_dataloader(dataset, batch_size=10, split='train')¶: for PLMs, serialize for PWEs, encode

run_step(batch, reduction='mean', device=device(type='cpu'))¶: Args: batch data Returns: loss

training: bool¶

matchbench.model.column_type_annotation.sherlock module¶

class matchbench.model.column_type_annotation.sherlock.Sherlock(multi_label=False, loss_function=<class 'torch.nn.modules.loss.CrossEntropyLoss'>, process_num: int = 4, row_num: int = 1000, word_embedding_dim: int = 50, paragraph_embedding_dim: int = 400, paragraph_negative_sample: int = 3, paragraph_train_epoch: int = 20, paragraph_min_count: int = 2, label_num: int = 78, embedding_dim: int = 500, dropout_ratio: float = 0.5, topic_dim: ~typing.Optional[int] = None, lda_batch_size: int = 5000, lad_minimum_probability: float = 0.0, lda_long_threshold: int = 0, lda_numeric_rep: str = 'directstr', max_col_count: int = 6, feature_types: ~typing.List[str] = ['charactor', 'word', 'paragraph', 'global'])¶

Bases: CTAModel

forward(data: Dict[str, Tensor])¶: # self.encoder: PLM serialization -> feature vector # self.matcher: feature vector -> logits for PLMs, encode + match for PWEs, match

label_mapping = {}¶

predict(batch, device=device(type='cpu'))¶: Args: batch data Returns: prediction

prepare_dataloader(dataset, batch_size=16, split='train')¶: for PLMs, serialize for PWEs, encode

run_step(batch, device=device(type='cpu'))¶: Args: batch data Returns: loss

serialize(dataset)¶: for PLMs, pair -> PLM serialization for pretrained word embedding, no need.

training: bool¶

matchbench.model.column_type_annotation package¶

Submodules¶

matchbench.model.column_type_annotation.doduo module¶

matchbench.model.column_type_annotation.sato module¶

matchbench.model.column_type_annotation.sherlock module¶

Module contents¶