matchbench.model.column_type_annotation package

Submodules

matchbench.model.column_type_annotation.doduo module

class matchbench.model.column_type_annotation.doduo.Doduo(max_length: int = 8, label_num: int = 78, multi_label: bool = False, hidden_dropout_probability: float = 0.5, pretrained_model_name: str = 'bert-base-uncased', loss_function=<class 'torch.nn.modules.loss.CrossEntropyLoss'>)

Bases: CTAModel

forward(data: Tensor, index: Tensor)

# self.encoder: PLM serialization -> feature vector # self.matcher: feature vector -> logits for PLMs, encode + match for PWEs, match

label_mapping = {}
predict(batch, device=device(type='cpu'))

Args: batch data Returns: prediction

prepare_dataloader(dataset, batch_size=16, split='train')

for PLMs, serialize for PWEs, encode

run_step(batch, device=device(type='cpu'))

Args: batch data Returns: loss

tokenize(tables: List[DataFrame])
training: bool

matchbench.model.column_type_annotation.sato module

class matchbench.model.column_type_annotation.sato.CRF(label_num, batch_first)

Bases: Module

decode(emissions: Tensor, mask: Optional[ByteTensor] = None) List[List[int]]

Find the most likely tag sequence using Viterbi algorithm.

Parameters:
  • emissions (~torch.Tensor) – Emission score tensor of size (seq_length, batch_size, label_num) if batch_first is False, (batch_size, seq_length, label_num) otherwise.

  • mask (~torch.ByteTensor) – Mask tensor of size (seq_length, batch_size) if batch_first is False, (batch_size, seq_length) otherwise.

Returns:

List of list containing the best tag sequence for each batch.

forward(emissions: Tensor, tags: LongTensor, mask: Optional[ByteTensor] = None, reduction: str = 'sum') Tensor

Compute the conditional log likelihood of a sequence of tags given emission scores.

Parameters:
  • emissions (~torch.Tensor) – Emission score tensor of size (seq_length, batch_size, label_num) if batch_first is False, (batch_size, seq_length, label_num) otherwise.

  • tags (~torch.LongTensor) – Sequence of tags tensor of size (seq_length, batch_size) if batch_first is False, (batch_size, seq_length) otherwise.

  • mask (~torch.ByteTensor) – Mask tensor of size (seq_length, batch_size) if batch_first is False, (batch_size, seq_length) otherwise.

  • reduction – Specifies the reduction to apply to the output: none|sum|mean|token_mean. none: no reduction will be applied. sum: the output will be summed over batches. mean: the output will be averaged over batches. token_mean: the output will be averaged over tokens.

Returns:

The log likelihood. This will have size (batch_size,) if reduction is none, () otherwise.

Return type:

~torch.Tensor

init_transition(tables)
reset_parameters()
training: bool
class matchbench.model.column_type_annotation.sato.Sato(fixed_sherlock_params: bool = True, batch_first: bool = True, sherlock_model: ~matchbench.model.column_type_annotation.sherlock.Sherlock = Sherlock(   (loss): CrossEntropyLoss()   (FeatureEncoder_charactor): _FeatureEncoder(     (bn1): BatchNorm1d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (linear1): Linear(in_features=960, out_features=300, bias=True)     (relu1): ReLU()     (dp1): Dropout(p=0.5, inplace=False)     (linear2): Linear(in_features=300, out_features=300, bias=True)     (relu2): ReLU()   )   (FeatureEncoder_word): _FeatureEncoder(     (bn1): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (linear1): Linear(in_features=200, out_features=200, bias=True)     (relu1): ReLU()     (dp1): Dropout(p=0.5, inplace=False)     (linear2): Linear(in_features=200, out_features=200, bias=True)     (relu2): ReLU()   )   (FeatureEncoder_paragraph): _FeatureEncoder(     (bn1): BatchNorm1d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (linear1): Linear(in_features=400, out_features=400, bias=True)     (relu1): ReLU()     (dp1): Dropout(p=0.5, inplace=False)     (linear2): Linear(in_features=400, out_features=400, bias=True)     (relu2): ReLU()   )   (FeatureEncoder_global): _FeatureEncoder(     (bn1): BatchNorm1d(27, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (linear1): Linear(in_features=27, out_features=27, bias=True)     (relu1): ReLU()     (dp1): Dropout(p=0.5, inplace=False)     (linear2): Linear(in_features=27, out_features=27, bias=True)     (relu2): ReLU()   )   (bn1): BatchNorm1d(927, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)   (linear1): Linear(in_features=927, out_features=500, bias=True)   (relu1): ReLU()   (dp1): Dropout(p=0.5, inplace=False)   (linear2): Linear(in_features=500, out_features=500, bias=True)   (relu2): ReLU()   (linear3): Linear(in_features=500, out_features=78, bias=True) ))

Bases: CTAModel

label_mapping = {}
parameters()

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Parameters:

recurse (bool) – if True, then yields parameters of this module and all submodules. Otherwise, yields only parameters that are direct members of this module.

Yields:

Parameter – module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)
predict(batch, device=device(type='cpu'))

Args: batch data Returns: prediction

prepare_dataloader(dataset, batch_size=10, split='train')

for PLMs, serialize for PWEs, encode

run_step(batch, reduction='mean', device=device(type='cpu'))

Args: batch data Returns: loss

training: bool

matchbench.model.column_type_annotation.sherlock module

class matchbench.model.column_type_annotation.sherlock.Sherlock(multi_label=False, loss_function=<class 'torch.nn.modules.loss.CrossEntropyLoss'>, process_num: int = 4, row_num: int = 1000, word_embedding_dim: int = 50, paragraph_embedding_dim: int = 400, paragraph_negative_sample: int = 3, paragraph_train_epoch: int = 20, paragraph_min_count: int = 2, label_num: int = 78, embedding_dim: int = 500, dropout_ratio: float = 0.5, topic_dim: ~typing.Optional[int] = None, lda_batch_size: int = 5000, lad_minimum_probability: float = 0.0, lda_long_threshold: int = 0, lda_numeric_rep: str = 'directstr', max_col_count: int = 6, feature_types: ~typing.List[str] = ['charactor', 'word', 'paragraph', 'global'])

Bases: CTAModel

forward(data: Dict[str, Tensor])

# self.encoder: PLM serialization -> feature vector # self.matcher: feature vector -> logits for PLMs, encode + match for PWEs, match

label_mapping = {}
predict(batch, device=device(type='cpu'))

Args: batch data Returns: prediction

prepare_dataloader(dataset, batch_size=16, split='train')

for PLMs, serialize for PWEs, encode

run_step(batch, device=device(type='cpu'))

Args: batch data Returns: loss

serialize(dataset)

for PLMs, pair -> PLM serialization for pretrained word embedding, no need.

training: bool

Module contents