matchbench.model.column_type_annotation package¶
Submodules¶
matchbench.model.column_type_annotation.doduo module¶
- class matchbench.model.column_type_annotation.doduo.Doduo(max_length: int = 8, label_num: int = 78, multi_label: bool = False, hidden_dropout_probability: float = 0.5, pretrained_model_name: str = 'bert-base-uncased', loss_function=<class 'torch.nn.modules.loss.CrossEntropyLoss'>)¶
Bases:
CTAModel
- forward(data: Tensor, index: Tensor)¶
# self.encoder: PLM serialization -> feature vector # self.matcher: feature vector -> logits for PLMs, encode + match for PWEs, match
- label_mapping = {}¶
- predict(batch, device=device(type='cpu'))¶
Args: batch data Returns: prediction
- prepare_dataloader(dataset, batch_size=16, split='train')¶
for PLMs, serialize for PWEs, encode
- run_step(batch, device=device(type='cpu'))¶
Args: batch data Returns: loss
matchbench.model.column_type_annotation.sato module¶
- class matchbench.model.column_type_annotation.sato.CRF(label_num, batch_first)¶
Bases:
Module
- decode(emissions: Tensor, mask: Optional[ByteTensor] = None) List[List[int]] ¶
Find the most likely tag sequence using Viterbi algorithm.
- Parameters:
emissions (~torch.Tensor) – Emission score tensor of size
(seq_length, batch_size, label_num)
ifbatch_first
isFalse
,(batch_size, seq_length, label_num)
otherwise.mask (~torch.ByteTensor) – Mask tensor of size
(seq_length, batch_size)
ifbatch_first
isFalse
,(batch_size, seq_length)
otherwise.
- Returns:
List of list containing the best tag sequence for each batch.
- forward(emissions: Tensor, tags: LongTensor, mask: Optional[ByteTensor] = None, reduction: str = 'sum') Tensor ¶
Compute the conditional log likelihood of a sequence of tags given emission scores.
- Parameters:
emissions (~torch.Tensor) – Emission score tensor of size
(seq_length, batch_size, label_num)
ifbatch_first
isFalse
,(batch_size, seq_length, label_num)
otherwise.tags (~torch.LongTensor) – Sequence of tags tensor of size
(seq_length, batch_size)
ifbatch_first
isFalse
,(batch_size, seq_length)
otherwise.mask (~torch.ByteTensor) – Mask tensor of size
(seq_length, batch_size)
ifbatch_first
isFalse
,(batch_size, seq_length)
otherwise.reduction – Specifies the reduction to apply to the output:
none|sum|mean|token_mean
.none
: no reduction will be applied.sum
: the output will be summed over batches.mean
: the output will be averaged over batches.token_mean
: the output will be averaged over tokens.
- Returns:
The log likelihood. This will have size
(batch_size,)
if reduction isnone
,()
otherwise.- Return type:
~torch.Tensor
- init_transition(tables)¶
- reset_parameters()¶
- class matchbench.model.column_type_annotation.sato.Sato(fixed_sherlock_params: bool = True, batch_first: bool = True, sherlock_model: ~matchbench.model.column_type_annotation.sherlock.Sherlock = Sherlock( (loss): CrossEntropyLoss() (FeatureEncoder_charactor): _FeatureEncoder( (bn1): BatchNorm1d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=960, out_features=300, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=300, out_features=300, bias=True) (relu2): ReLU() ) (FeatureEncoder_word): _FeatureEncoder( (bn1): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=200, out_features=200, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=200, out_features=200, bias=True) (relu2): ReLU() ) (FeatureEncoder_paragraph): _FeatureEncoder( (bn1): BatchNorm1d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=400, out_features=400, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=400, out_features=400, bias=True) (relu2): ReLU() ) (FeatureEncoder_global): _FeatureEncoder( (bn1): BatchNorm1d(27, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=27, out_features=27, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=27, out_features=27, bias=True) (relu2): ReLU() ) (bn1): BatchNorm1d(927, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (linear1): Linear(in_features=927, out_features=500, bias=True) (relu1): ReLU() (dp1): Dropout(p=0.5, inplace=False) (linear2): Linear(in_features=500, out_features=500, bias=True) (relu2): ReLU() (linear3): Linear(in_features=500, out_features=78, bias=True) ))¶
Bases:
CTAModel
- label_mapping = {}¶
- parameters()¶
Returns an iterator over module parameters.
This is typically passed to an optimizer.
- Parameters:
recurse (bool) – if True, then yields parameters of this module and all submodules. Otherwise, yields only parameters that are direct members of this module.
- Yields:
Parameter – module parameter
Example:
>>> for param in model.parameters(): >>> print(type(param), param.size()) <class 'torch.Tensor'> (20L,) <class 'torch.Tensor'> (20L, 1L, 5L, 5L)
- predict(batch, device=device(type='cpu'))¶
Args: batch data Returns: prediction
- prepare_dataloader(dataset, batch_size=10, split='train')¶
for PLMs, serialize for PWEs, encode
- run_step(batch, reduction='mean', device=device(type='cpu'))¶
Args: batch data Returns: loss
matchbench.model.column_type_annotation.sherlock module¶
- class matchbench.model.column_type_annotation.sherlock.Sherlock(multi_label=False, loss_function=<class 'torch.nn.modules.loss.CrossEntropyLoss'>, process_num: int = 4, row_num: int = 1000, word_embedding_dim: int = 50, paragraph_embedding_dim: int = 400, paragraph_negative_sample: int = 3, paragraph_train_epoch: int = 20, paragraph_min_count: int = 2, label_num: int = 78, embedding_dim: int = 500, dropout_ratio: float = 0.5, topic_dim: ~typing.Optional[int] = None, lda_batch_size: int = 5000, lad_minimum_probability: float = 0.0, lda_long_threshold: int = 0, lda_numeric_rep: str = 'directstr', max_col_count: int = 6, feature_types: ~typing.List[str] = ['charactor', 'word', 'paragraph', 'global'])¶
Bases:
CTAModel
- forward(data: Dict[str, Tensor])¶
# self.encoder: PLM serialization -> feature vector # self.matcher: feature vector -> logits for PLMs, encode + match for PWEs, match
- label_mapping = {}¶
- predict(batch, device=device(type='cpu'))¶
Args: batch data Returns: prediction
- prepare_dataloader(dataset, batch_size=16, split='train')¶
for PLMs, serialize for PWEs, encode
- run_step(batch, device=device(type='cpu'))¶
Args: batch data Returns: loss
- serialize(dataset)¶
for PLMs, pair -> PLM serialization for pretrained word embedding, no need.