matchbench.model.entity_alignment package¶
Submodules¶
matchbench.model.entity_alignment.Dual_AMN module¶
- class matchbench.model.entity_alignment.Dual_AMN.Dual_AMN(triple_size=353543, node_size=39654, new_node_size=39654, rel_size=4224, nearest_sample_num=128, alpha=0.1, beta=0.1, gamma=1.0, depth=2, node_hidden=128, rel_hidden=128, dropout_rate=0.3, ind_dropout_rate=0.3, device=device(type='cuda', index=0))¶
Bases:
EAModel
The class for Dual_AMN approach.
- Class attributes:
Data/Feature generation triple_size (int): The size of triples in KG. node_size (int): The size of nodes in KG. new_node_size (int): The size of nodes calculated from all triples. rel_size (int): The size of relations in KG. nearest_sample_num (int): The generated candidates number when negtive sampling.
Model architecture / Loss alpha (float): The value of alpha. beta (float): The value of beta. gamma (float): The value of gamma. depth (int): The layer depth of NR_GraphAttention. node_hidden (int): The dimension of node representation. rel_hidden (int): The dimension of relation representation. dropout_rate (float): The probability of nn.Dropout function.
Others device (torch.device, optional, defaults to “cuda”): cuda or cpu.
- encode(batch=None)¶
Convert a batch of entity token ids to a batch of entity vector embedding. In the default scenario, Dual_AMN encode all entities in KG.
- Parameters:
batch (a dict of torch.tensor) – Keys: input_ids and attention_mask.
- Returns:
The entity embedding.
- Return type:
torch.tensor
- get_emb(loader, device=device(type='cuda'))¶
Convert a list of entities token ids to a list of embedding after encoder. :param loader: :type loader: torch.data.utils.DataLoader :param device: cuda or cpu. :type device: torch.device, optional, defaults to “cuda”
- Returns:
the output embeddings of the encoder.
- Return type:
torch.tensor
- load_source_target(dataset_src, dataset_tgt, stage=1)¶
Prepare source and target data and initialize Dual_AMN model :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset
- predict(dataloader, device=device(type='cuda'))¶
Predict the closest entities of each entity in test set. :param dataloader: :type dataloader: torch.data.utils.DataLoader
- Returns:
The result.
- Return type:
numpy.array
- prepare_dataloader(dataset, split='train', batch_size=128, stage=None, mid_file_dir=None)¶
- For split train, prepare the dataloaders for Training.
For split valid and test, prepare the dataloaders only for evaluation.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
stage (int, optional, defaults to 1) – Dual_AMN only has one training stages.
batch_size (int, optional, defaults to 128) – Batch_size for the following negtive sampling or training.
mid_file_dir (str, optional, defaults to “middle_file/”) – The director where middle processed files stored.
- Returns:
For split train, return torch.utils.data.Dataloader. For split valid and test, return tuple of torch.utils.data.Dataloader.
- run_step(train_pairs: Tensor, device)¶
The whole process training one batch (step).
- Parameters:
train_pairs (torch.tensor) –
device (str, optional, defaults to “cuda”) – cuda or cpu.
- Returns:
loss
- Return type:
torch.tensor
- class matchbench.model.entity_alignment.Dual_AMN.NR_GraphAttention(node_size, rel_size, triple_size, node_dim, depth=1, attn_heads=1, attn_heads_reduction='concat', use_bias=False)¶
Bases:
Module
The class for RREA approach NR_GraphAttention encoder unit.
- Class attributes:
node_size (int): The size of nodes in KG. rel_size (int): The size of relations in KG. triple_size (int): The size of triples in KG. node_dim (int): The node representation dimension of NR_GraphAttention. depth (int): The layer depth of NR_GraphAttention. attn_heads (int): The attention head number of NR_GraphAttention. attn_heads_reduction (str): The reduction ways of attention heads. use_bias (bool): If the model use bias or not.
- forward(inputs)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
matchbench.model.entity_alignment.LargeEA module¶
matchbench.model.entity_alignment.bertint module¶
- class matchbench.model.entity_alignment.bertint.Basic_Bert_Unit_model(plm_path, basic_input_dim, basic_output_dim, dropout=0.1, encoder_model=None)¶
Bases:
Module
The class for BertInt approach Basic_Bert_Unit_model encoder unit.
- Class attributes:
plm_path (str): The local pretrained language model path. basic_input_dim (int): The input dimension of Basic_Bert_Unit_model. basic_output_dim (int): The output dimension of Basic_Bert_Unit_model. dropout (float): The probability of nn.Dropout function. encodermodel (transformers.PretrainedModel): User defined plm model.
- forward(batch)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_alignment.bertint.BertInt(if_neg_sample=True, if_neg_sample_2=False, max_seq_length=128, nearest_sample_num=128, test_topk=50, basic_input_dim=768, basic_output_dim=300, mlp_input_dim=85, mlp_hidden_dim=11, dropout=0.1, margin_1=3, margin_2=1, plm_path=None, tokenizer=None, encodermodel=None)¶
Bases:
EAModel
The class for BertInt approach.
- Class attributes:
Data/Feature generation if_neg_sample (bool): If the approach need negtive sampling or not. max_seq_length (int): The max length of tokens fed into PLMs. get_emb_batch (int): The batch size when getting the PLM embedding of all the entities. nearest_sample_num (int): The generated candidates number when negtive sampling. test_topk(int): Max hit when testing.
Model architecture / Loss basic_input_dim (int): The input dimension of Basic_Bert_Unit_model. basic_output_dim (int): The output dimension of Basic_Bert_Unit_model. mlp_input_dim (int): The input dimension of MLP. mlp_hidden_dim (int): The output dimension of MLP. dropout (float): The probability of nn.Dropout function. margin_1 (int): The parameter in MarginRankingLoss in stage 1. margin_2 (int): The parameter in MarginRankingLoss in stage 2.
Others plm_path (str): The local pretrained language model path. tokenizer (transformers.Tokenizer): User defined plm tokenizer. encodermodel (transformers.PretrainedModel): User defined plm model.
- calculate_loss(pos_score, neg_score, label)¶
Calculate the loss of the batch.
- Parameters:
pos_score (torch.tensor) –
neg_score (torch.tensor) –
label (torch.label) –
- Returns:
loss
- Return type:
torch.tensor
- chuliyixia(train, test)¶
- compute_metric_stage_2(prediction)¶
Calculate the hits in stage 2. :param prediction: the top indexes of each testing entity. :type prediction: torch.tensor
- Returns:
hits@1
- Return type:
float
- encode(batch)¶
Convert a batch of entity token ids to a batch of entity vector embedding.
- Parameters:
batch (a dict of torch.tensor) – Keys: input_ids and attention_mask.
- Returns:
The entity embedding.
- Return type:
torch.tensor
- forward(batch, device=device(type='cuda'))¶
Convert a batch of four kinds of entity token ids to a batch of postive/negtive score.
- Parameters:
batch (a dict of torch.tensor) – Keys: pos1, pos2, neg1, neg2.
- Returns:
postive/negtive score.
- Return type:
tuple of torch.tensor
- load_source_target(dataset_src, dataset_tgt)¶
Prepare source and target data. :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset
- matcher(feature)¶
Applied in stage 2. From the relational features get the final score. :param feature: Relational features. :type feature: torch.tensor
- Returns:
The final result.
- Return type:
torch.tensor
- negative_sample(block_loaders, device=device(type='cuda'), batch_size=24)¶
Negative sampling.
- Parameters:
block_loaders (tuple of torch.utils.data.Dataloader) – four loaders described in prepare_dataloader.
device (torch.device, optional, defaults to “cuda”) – cuda or cpu.
batch_size (int, optional, defaults to 24) – batch_size for the following pairwise training.
- Returns:
dataloader for the following pairwise training.
- Return type:
torch.utils.data.Dataloader
- predict_stage_2(dataloader, device=device(type='cuda'))¶
Predict the result in stage 2, rewriting the predict_stage_2 in base_model.
- Parameters:
dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.
- Returns:
The result.
- Return type:
List
- prepare_dataloader(dataset, split='train', stage=1, batch_size=128, mid_file_dir='bertint_middle_file/')¶
- In stage 1, for split train, prepare the dataloaders for negtive sampling.
For split valid and test, prepare the dataloaders only for evaluation. In stage 2, prepare dataloaders for training and evaluation.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
stage (int, optional, defaults to 1) – BertInt has two training stages. A integer indicating which stage the training process in.
batch_size (int, optional, defaults to 128) – Batch_size for the following negtive sampling or training.
mid_file_dir (str, optional, defaults to “middle_file/”) – The director where middle processed files stored.
- Returns:
In stage 1, for split train, return two train loaders and two all entities loaders. For split valid and test, return two valid/test loaders. In stage 2, return RelationalDataloader for split train and return torch.utils.data.Dataloader for others.
- Return type:
tuple of torch.utils.data.Dataloader
- run_step(batch, stage=1, device=device(type='cuda'))¶
The whole process training one batch (step).
- Parameters:
batch (tuple of torch.tensor) –
stage (int, optional, defaults to 1) –
device (str, optional, defaults to “cuda”) – cuda or cpu.
- Returns:
loss
- Return type:
torch.tensor
- serialize(des)¶
Convert entity descriptions to token ids which can be directly fed into PLMs.
- Parameters:
des (dict) – Key: entity name. Value: entity description.
- Returns:
Key: entity id. Value: entity token ids.
- Return type:
dict
- class matchbench.model.entity_alignment.bertint.MLP(mlp_input_dim, mlp_hidden_dim)¶
Bases:
Module
The class for BertInt approach MLP matcher unit.
- Class attributes:
mlp_input_dim (str): The input dimension of MLP. mlp_hidden_dim (int): The output dimension of MLP.
- forward(features)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_alignment.bertint.PairwiseDataset(train_tups, ent2data)¶
Bases:
Dataset
The class of Dataset used for training in stage 1.
- Class attributes:
train_tups(list): Training indexes pairs. ent2data(dict): entity id to token entity embedding.
- class matchbench.model.entity_alignment.bertint.RelationalDataloader(train_ill, train_candidate, entpair2f_idx, all_features, neg_num, batch_size)¶
Bases:
object
The class of Dataset used for training in stage 2.
- train_pair_index_gene()¶
generate training data (entity_index).
- matchbench.model.entity_alignment.bertint.all_entity_pairs_gene(candidate_dict_list, ill_pair_list)¶
- matchbench.model.entity_alignment.bertint.attributeValue_emb_gene(l_set, model, Tokenizer, batch_size, max_length, device=device(type='cuda'))¶
generate attributeValue embedding by basic bert unit
- matchbench.model.entity_alignment.bertint.attributeView_interaction_F_gene(ent_pairs, value_emb_list, ent2valueids, value_pad_id, kernel_num=21, batch_size=512, device=device(type='cuda'))¶
Attribute-View Interaction. use Dual Aggregation and Attribute-View Interaction to generate Similarity Feature between entity pairs. return entity pairs and features(between entity pairs)
- matchbench.model.entity_alignment.bertint.batch_dual_aggregation_feature_gene(batch_sim_matrix, mus, sigmas, attn_ne1, attn_ne2)¶
Dual Aggregation. [similarity matrix -> feature] :param batch_sim_matrix: [B,ne1,ne2] :param mus: [1,1,k(kernel_num)] :param sigmas: [1,1,k] :param attn_ne1: [B,ne1,1] :param attn_ne2: [B,ne2,1] :return feature: [B,kernel_num * 2].
- matchbench.model.entity_alignment.bertint.candidate_generate(ents1, ents2, ent_emb, candidate_topk=50, bs=32, device=device(type='cuda'))¶
return a dict, key = entity, value = candidates (likely to be aligned entities)
- matchbench.model.entity_alignment.bertint.clean_attribute_data(dataset_src, dataset_tgt, mid_file_dir='bertint_middle_file/')¶
- matchbench.model.entity_alignment.bertint.desornameView_interaction_F_gene(ent_pairs, e_emb_list, batch_size=512, device=device(type='cuda'))¶
- matchbench.model.entity_alignment.bertint.dump_other_data(train_ents1, train_ents2, test_ents1, test_ents2, ent2data, mid_file_dir)¶
- matchbench.model.entity_alignment.bertint.ent2attributeValues_gene(entid_list, att_datas, max_length, pad_value=None)¶
get attribute Values of entity return a dict, key = entity ,value = (padding) attribute_values of entity
- matchbench.model.entity_alignment.bertint.get_attributeValue_embedding(model, batch_size=256, mid_file_dir='bertint_middle_file/', device=device(type='cuda'))¶
- matchbench.model.entity_alignment.bertint.get_attributeView_interaction_feature(model, mid_file_dir='bertint_middle_file/', device=device(type='cuda'))¶
- matchbench.model.entity_alignment.bertint.get_attribute_value_type(value, value_type)¶
- matchbench.model.entity_alignment.bertint.get_entity_embedding(model, batch_size=256, mid_file_dir='bertint_middle_file/', device=device(type='cuda'))¶
- matchbench.model.entity_alignment.bertint.get_neighView_and_desView_interaction_feature(model, dataset_src, dataset_tgt, batch_size=256, mid_file_dir='bertint_middle_file/', device=device(type='cuda'))¶
- matchbench.model.entity_alignment.bertint.get_tokens_of_value(vaule_list, Tokenizer, max_length)¶
- matchbench.model.entity_alignment.bertint.kernel_mus(n_kernels)¶
- matchbench.model.entity_alignment.bertint.kernel_sigmas(n_kernels)¶
- matchbench.model.entity_alignment.bertint.neigh_ent_dict_gene(rel_triples, max_length, pad_id=None)¶
get one hop neighbor of entity return a dict, key = entity, value = (padding) neighbors of entity
- matchbench.model.entity_alignment.bertint.neighborView_interaction_F_gene(ent_pairs, ent_emb_list, neigh_dict, ent_pad_id, kernel_num=21, batch_size=512, device=device(type='cuda'))¶
Neighbor-View Interaction. use Dual Aggregation and Neighbor-View Interaction to generate Similarity Feature between entity pairs. return entity pairs and features(between entity pairs)
- matchbench.model.entity_alignment.bertint.padding_to_longest(token_list, Tokenizer)¶
- matchbench.model.entity_alignment.bertint.read_att_data(dataset)¶
load attribute triples file.
- matchbench.model.entity_alignment.bertint.read_attribute_datas(kg1_att_file_name, kg2_att_file_name, entity_list, entity2index, add_name_as_attTriples=True)¶
return list of attribute triples [(entity_id,attribute,attributeValue,type of attributeValue)]
- matchbench.model.entity_alignment.bertint.remove_one_to_N_att_data_by_threshold(ori_keep_data, ori_remove_data, one2N_threshold)¶
Filter noise attribute triples based on threshold
- matchbench.model.entity_alignment.bertint.sort_a(data_list)¶
sort
- matchbench.model.entity_alignment.bertint.test_read_emb(ent_emb, train_ill, test_ill, bs=128, candidate_topk=50)¶
matchbench.model.entity_alignment.kecg module¶
- class matchbench.model.entity_alignment.kecg.GAT(n_units, n_heads, dropout, attn_dropout, instance_normalization, diag)¶
Bases:
Module
- forward(x, adj)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_alignment.kecg.MultiHeadGraphAttention(n_head, f_in, f_out, attn_dropout, diag=True, init=None, bias=False)¶
Bases:
Module
- forward(input, adj)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_alignment.kecg.SpecialSpmm¶
Bases:
Module
- forward(indices, values, shape, b)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_alignment.kecg.SpecialSpmmFunction¶
Bases:
Function
Special function for only sparse region backpropataion layer.
- static backward(ctx, grad_output)¶
Defines a formula for differentiating the operation.
This function is to be overridden by all subclasses.
It must accept a context
ctx
as the first argument, followed by as many outputs didforward()
return, and it should return as many tensors, as there were inputs toforward()
. Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.The context can be used to retrieve tensors saved during the forward pass. It also has an attribute
ctx.needs_input_grad
as a tuple of booleans representing whether each input needs gradient. E.g.,backward()
will havectx.needs_input_grad[0] = True
if the first input toforward()
needs gradient computated w.r.t. the output.
- static forward(ctx, indices, values, shape, b)¶
Performs the operation.
This function is to be overridden by all subclasses.
It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).
The context can be used to store tensors that can be then retrieved during the backward pass.
matchbench.model.entity_alignment.lightea module¶
- class matchbench.model.entity_alignment.lightea.LightEA(ent_dim=1024, depth=2, top_k=500, predict_epochs=10, using_name_features=True)¶
Bases:
EAModel
The class for LightEA approach.
- Parameters:
- batch_sparse_matmul(sparse_tensor, dense_tensor, batch_size=128, save_mem=False)¶
- compute_metric(result, stage=1)¶
Calculate the hits. :param prediction: the top indexes of each testing entity. :type prediction: torch.tensor
- Returns:
hits@1
- Return type:
float
- get_features(train_pair, extra_feature=None)¶
- load_source_target(dataset_src, dataset_tgt, mid_file_dir='middle_file/', has_mid_files=False)¶
Prepare source and target data. :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset :param mid_file_dir: The director where middle processed files stored. :type mid_file_dir: str, optional, defaults to “middle_file/” :param has_mid_files: If the middle files has already been generated. :type has_mid_files: bool
- predict(dataset, stage=1, train_dataloader=None)¶
- For LightEA approach, the train process and test process are implemented together
and cannot be separated.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
stage (int, optional, defaults to 1) – Unused in LightEA.
- Returns:
The hits@1 score.
- Return type:
- prepare_dataloader(dataset, split='test', batch_size=32, stage=1)¶
- For split train, prepare the pairs for training.
For split test, prepare the pairs for testing.
- Parameters:
- Returns:
Train or test pairs.
- Return type:
np.array
- random_projection(x, out_dim)¶
- segment_sum(data, segment_ids)¶
- sparse_sinkhorn_sims(left, right, features, top_k=500, iteration=15, mode='test', epoch=0)¶
- test(test_pair, features, top_k=500, iteration=15)¶
- matchbench.model.entity_alignment.lightea.load_aligned_pair(file_path, ratio=0.3)¶
- matchbench.model.entity_alignment.lightea.load_graph(path)¶
- matchbench.model.entity_alignment.lightea.load_name_features(dataset_src, dataset_tgt, vector_path, mode='word-level', mid_file_dir='middle_file/')¶
matchbench.model.entity_alignment.rrea module¶
- class matchbench.model.entity_alignment.rrea.NR_GraphAttention(node_size, rel_size, triple_size, node_dim, depth=1, attn_heads=1, attn_heads_reduction='concat', use_bias=False, activation='relu')¶
Bases:
Module
The class for RREA approach NR_GraphAttention encoder unit.
- Class attributes:
node_size (int): The size of nodes in KG. rel_size (int): The size of relations in KG. triple_size (int): The size of triples in KG. node_dim (int): The node representation dimension of NR_GraphAttention. depth (int): The layer depth of NR_GraphAttention. attn_heads (int): The attention head number of NR_GraphAttention. attn_heads_reduction (str): The reduction ways of attention heads. use_bias (bool): If the model use bias or not. activation(str): The activation funciton of NR_GraphAttention.
- forward(inputs)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_alignment.rrea.RREA(alpha=0.1, beta=0.1, gamma=3, depth=2, node_hidden=100, rel_hidden=100, dropout_rate=0.3, triple_size=353543, node_size=39654, new_node_size=39654, rel_size=4224, if_neg_sample=True, nearest_sample_num=128, device=device(type='cuda', index=0))¶
Bases:
EAModel
The class for RREA approach.
- Class attributes:
Data/Feature generation triple_size (int): The size of triples in KG. node_size (int): The size of nodes in KG. new_node_size (int): The size of nodes calculated from all triples. rel_size (int): The size of relations in KG. if_neg_sample (bool): If the approach need negtive sampling or not. nearest_sample_num (int): The generated candidates number when negtive sampling.
Model architecture / Loss alpha (float): The value of alpha. beta (float): The value of beta. gamma (float): The value of gamma. depth (int): The layer depth of NR_GraphAttention. node_hidden (int): The dimension of node representation. rel_hidden (int): The dimension of relation representation. dropout_rate (float): The probability of nn.Dropout function.
Others device (torch.device, optional, defaults to “cuda”): cuda or cpu.
- encode(batch=None)¶
Convert a batch of entity token ids to a batch of entity vector embedding. In the default scenario, RREA encode all entities in KG.
- Parameters:
batch (a dict of torch.tensor) – Keys: input_ids and attention_mask.
- Returns:
The entity embedding.
- Return type:
torch.tensor
- get_emb(loader, device=device(type='cuda'))¶
Convert a list of entities token ids to a list of embedding after encoder. :param loader: :type loader: torch.data.utils.DataLoader :param device: cuda or cpu. :type device: torch.device, optional, defaults to “cuda”
- Returns:
the output embeddings of the encoder.
- Return type:
torch.tensor
- load_source_target(dataset_src, dataset_tgt, stage=1)¶
Prepare source and target data and initialize RREA model :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset
- negative_sample(train_pair, batch_size=24, device=None)¶
Negative sampling.
- Parameters:
train_pair (np.array) – numpy array described in prepare_dataloader.
batch_size (int, optional, defaults to 24) – batch_size for the following pairwise training.
device (torch.device, optional, defaults to “cuda”) – cuda or cpu.
- Returns:
dataloader for the following pairwise training.
- Return type:
torch.utils.data.Dataloader
- predict(dataloader, device=device(type='cuda'), stage=None, train_dataloader=None)¶
Predict the result, rewriting the predict in base_model.
- Parameters:
dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.
device (torch.device, optional, defaults to “cuda”) – cuda or cpu.
- Returns:
The result.
- Return type:
numpy.array
- prepare_dataloader(dataset, split='train', batch_size=128, stage=None, mid_file_dir=None)¶
- For split train, prepare the dataloaders for Training.
For split valid and test, prepare the dataloaders only for evaluation.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
stage (int, optional, defaults to 1) – RREA only has one training stages.
batch_size (int, optional, defaults to 128) – Batch_size for the following negtive sampling or training.
mid_file_dir (str, optional, defaults to “middle_file/”) – The director where middle processed files stored.
- Returns:
For split train, return np.array. For split valid and test, return tuple of torch.utils.data.Dataloader.
- run_step(train_pairs: Tensor, device)¶
The whole process training one batch (step).
- Parameters:
train_pairs (torch.tensor) –
device (str, optional, defaults to “cuda”) – cuda or cpu.
- Returns:
loss
- Return type:
torch.tensor
matchbench.model.entity_alignment.sdea module¶
- class matchbench.model.entity_alignment.sdea.Basic_Bert_Unit_model(plm_path='/home/wangp/SDEA-main/pre_trained_models/bert-base-multilingual-uncased', basic_input_dim=768, basic_output_dim=300, dropout=0.1, encoder_model=None)¶
Bases:
Module
The class for SDEA approach Basic_Bert_Unit_model unit.
- Parameters:
plm_path (str) – The local pretrained language model path.
basic_input_dim (int) – The input dimension of Basic_Bert_Unit_model.
basic_output_dim (int) – The output dimension of Basic_Bert_Unit_model.
dropout (float) – The probability of nn.Dropout function.
encoder_model (transformers.PretrainedModel) – User defined plm model.
- bert_model¶
The encoder model used in SDEA.
- Type:
transformers.PretrainedModel
- forward(batch)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_alignment.sdea.BertDataLoader(dataset)¶
Bases:
object
The class mainly serves to tokenize words and strings to token ids.
- load_data(line_solver)¶
- static load_freq(dataset)¶
- static load_saved_data(dataset)¶
- run()¶
- save_data(datas)¶
- matchbench.model.entity_alignment.sdea.DBPpreprocess(dataset_att)¶
- class matchbench.model.entity_alignment.sdea.Dataset(no, middle_file)¶
Bases:
object
The class recording the suffixes of middle files.
- static outputs_csv(name, no, path)¶
- static outputs_python(name, no, path)¶
- static outputs_tab(name, no, path)¶
- class matchbench.model.entity_alignment.sdea.GRUAttnNet(embed_dim, hidden_dim, hidden_layers, dropout=0, device: device = 'cpu')¶
Bases:
Module
- attn_net_with_w(rnn_out, rnn_hn, neighbor_mask: Tensor, x)¶
- Parameters:
rnn_out – [batch_size, seq_len, n_hidden * 2]
rnn_hn – [batch_size, num_layers * num_directions, n_hidden]
- Returns:
- build_model(dropout)¶
- forward(x, neighbor_mask)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class matchbench.model.entity_alignment.sdea.Highway(dim, device: device)¶
Bases:
Module
- forward(x)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- init_Linear(in_fea, out_fea, bias)¶
- class matchbench.model.entity_alignment.sdea.KBStore(dataset=None, dataset_path: Optional[Dataset] = None, relation=1)¶
Bases:
object
The class generate all used middle files, include relation tokens, property tokens, etc.
- add_word_level_blocks(entity_id, words)¶
- get_property_table_line(lines)¶
- load(file_type: OEAFileType) None ¶
- load_entities()¶
- load_facts()¶
- load_kb_from_saved()¶
- load_literals()¶
- load_properties()¶
- load_relations()¶
- save_base_info()¶
- save_datas()¶
- save_facts()¶
- save_property_table()¶
- save_seq_form(dicts, header)¶
- class matchbench.model.entity_alignment.sdea.OEAFileType(value)¶
Bases:
Enum
The triple types.
- attr = 0¶
- rel = 1¶
- ttl_full = 2¶
- class matchbench.model.entity_alignment.sdea.PairwiseDataset(train_tups, ent2data1, ent2data2)¶
Bases:
Dataset
- class matchbench.model.entity_alignment.sdea.RelationDataset(train_tups_r, fs1: KBStore, fs2: KBStore, ent2embed1: Tensor, ent2embed2: Tensor)¶
Bases:
Dataset
- static get_matrix(neighbors, nm_len, pad_idx)¶
- class matchbench.model.entity_alignment.sdea.RelationModel(rel_count1, rel_count2, all_embed1_size, all_embed2_size, score_distance_level=2, margin=1, device=device(type='cuda'))¶
Bases:
Module
The class for SDEA approach RelationModel unit.
- Parameters:
rel_count1 (int) – The relations count in source graph.
rel_count2 (int) – The relations count in target graph.
all_embed1_size (int) – Embedding size i.e. The number of all entities.
all_embed2_size (int) – Embedding size i.e. The number of all entities.
score_distance_level (int) – Distance level in pairwise distance.
margin (int) – Margin in marginrankingloss.
device (torch.device) – cuda or cpu.
- rnn¶
Relation embedding module.
- Type:
nn
- combiner¶
Combiner module.
- Type:
nn
- ent_embedding1¶
Entity embedding module..
- ent_embedding2¶
Entity embedding module.
- case_study(batch, rel_embedding: Embedding, all_embed, mode)¶
- forward(pe1s, pe2s, ne1s, ne2s, bpn1s, bpn2s, bnn1s, bnn2s, bpr1s, bpr2s, bnr1s, bnr2s)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- get_emb(batch, rel_embedding: Embedding, all_embed, mode)¶
- get_ent_embedding(all_embed1, all_embed2)¶
- static get_neighbors_batch(batch_facts, pad_idx, pad_idxr=None)¶
- get_rel_embeds(batch_neighbors, batch_relations, batch_ent: Tensor, rel_embedding, all_embed: Embedding)¶
- class matchbench.model.entity_alignment.sdea.RelationValidDataset(ents, fs: KBStore, all_embeds, batch_size)¶
Bases:
object
- class matchbench.model.entity_alignment.sdea.SDEA(if_neg_sample=True, max_seq_len=128, nearest_sample_num=128, basic_input_dim=768, basic_output_dim=300, dropout=0.1, margin=1, score_distance_level=2, plm_path=None, tokenizer=None, encodermodel=None)¶
Bases:
EAModel
The class for SDEA approach RelationModel unit.
- Parameters:
if_neg_sample (bool) – If the approach need negtive sampling or not.
max_seq_length (int) – The max length of tokens fed into PLMs.
nearest_sample_num (int) – The generated candidates number when negtive sampling.
Loss (Model architecture /) –
basic_input_dim (int) – The input dimension of Basic_Bert_Unit_model.
basic_output_dim (int) – The output dimension of Basic_Bert_Unit_model.
dropout (float) – The probability of nn.Dropout function.
margin (int) – The parameter in MarginRankingLoss in stage 1.
Others –
plm_path (str) – The local pretrained language model path.
tokenizer (transformers.Tokenizer) – User defined plm tokenizer.
encodermodel (transformers.PretrainedModel) – User defined plm model.
- all_embed1s¶
All entity embeddings.
- Type:
tensor
- all_embed2s¶
All entity embeddings.
- Type:
tensor
- a(all_embed1s_p, all_embed2s_p, device=None)¶
- calculate_loss(pos_score, neg_score, label)¶
Calculate the loss of the batch.
- Parameters:
pos_score (torch.tensor) –
neg_score (torch.tensor) –
label (torch.label) –
- Returns:
loss
- Return type:
torch.tensor
- class_name_str()¶
- encode(batch)¶
Convert a batch of entity token ids to a batch of entity vector embedding.
- Parameters:
des (a dict of torch.tensor) – Keys: input_ids and attention_mask.
- Returns:
The entity embedding.
- Return type:
torch.tensor
- forward(batch, device=device(type='cuda'))¶
Convert a batch of four kinds of entity token ids to a batch of postive/negtive score.
- Parameters:
des (a dict of torch.tensor) – Keys: pos1, pos2, neg1, neg2.
- Returns:
postive/negtive score.
- Return type:
tuple of torch.tensor
- load_source_target(dataset_src, dataset_tgt, mid_file_dir='middle_file/', has_mid_files=False)¶
Prepare source and target data. :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset :param mid_file_dir: The director where middle processed files stored. :type mid_file_dir: str, optional, defaults to “middle_file/” :param has_mid_files: If the middle files has already been generated. :type has_mid_files: bool
- negative_sample(block_loaders, device=device(type='cuda'), batch_size=24, stage=1)¶
Negtive sampling.
- Parameters:
block_loaders (tuple of torch.utils.data.Dataloader) – four loaders described in prepare_dataloader.
device (torch.device, optional, defaults to “cuda”) – cuda or cpu.
batch_size (int, optional, defaults to 24) – batch_size for the following pairwise training.
stage (int, optional, defaults to 1) – A integer indicating which stage the training process in.
- Returns:
dataloader for the following pairwise training.
- Return type:
torch.utils.data.Dataloader
- oea_truth_line(line)¶
- prepare_dataloader(dataset, split='train', batch_size=128, device=device(type='cuda'), stage=1)¶
- For split train, prepare the dataloaders for negtive sampling.
For split valid and test, prepare the dataloaders only for evaluation.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
stage (int, optional, defaults to 1) – SDEA has two training stages. A integer indicating which stage the training process in.
batch_size (int, optional, defaults to 128) – Batch_size for the following negtive sampling.
- Returns:
For split train, return two train loaders and two all entities loaders. For split valid and test, return two valid/test loaders.
- Return type:
tuple of torch.utils.data.Dataloader
- static reduce_tokens(tids, max_len=200)¶
- run_step(batch, stage=1, device=device(type='cuda'))¶
The whole process training one batch (step).
- Parameters:
batch (tuple of torch.tensor) –
stage (int, optional, defaults to 1) –
device (str, optional, defaults to “cuda”) – cuda or cpu.
- Returns:
loss
- Return type:
torch.tensor
- matchbench.model.entity_alignment.sdea.compress_uri(uri, step=1)¶
- matchbench.model.entity_alignment.sdea.oea_attr_line(fact, step=1)¶
- matchbench.model.entity_alignment.sdea.oea_rel_line(fact)¶
- matchbench.model.entity_alignment.sdea.save_list(l, file)¶
- matchbench.model.entity_alignment.sdea.stripSquareBrackets(s)¶
- matchbench.model.entity_alignment.sdea.strip_square_brackets(s)¶
- matchbench.model.entity_alignment.sdea.text_to_word_sequence(text, filters='!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')¶
- matchbench.model.entity_alignment.sdea.ttl_no_compress_line(line)¶
matchbench.model.entity_alignment.seu module¶
- class matchbench.model.entity_alignment.seu.SEU(depth=2, mode='hungarian')¶
Bases:
EAModel
The class for SEU approach.
- Parameters:
- cal_sims(test_pair, feature)¶
- compute_metric(result)¶
Calculate the hits. :param prediction: the top indexes of each testing entity. :type prediction: torch.tensor
- Returns:
hits@1
- Return type:
float
- load_source_target(dataset_src, dataset_tgt, mid_file_dir='middle_file/', has_mid_files=False)¶
Prepare source and target data.
- Parameters:
dataset_src (datasets.arrow_dataset.Dataset) – Source dataset.
dataset_tgt (datasets.arrow_dataset.Dataset) – Target dataset.
mid_file_dir (str, optional, defaults to “middle_file/”) – The director where middle processed files stored.
has_mid_files (bool) – If the middle files has already been generated.
- predict(dataset, stage=1, train_dataloader=None)¶
- For SEU approach, the train process and test process are implemented together
and cannot be separated.
- Parameters:
dataset (datasets.arrow_dataset.Dataset) – Train/test pairs.
split (str, optional, defaults to “train”) – A string indicating which split the dataset is.
stage (int, optional, defaults to 1) – Unused in SEU.
- Returns:
The hits@1 score.
- Return type:
- prepare_dataloader(dataset, split='test', batch_size=1024, stage=1)¶
prepare the pairs for both training and testing.
- Parameters:
- Returns:
Train or test pairs.
- Return type:
np.array
- matchbench.model.entity_alignment.seu.load_aligned_pair(file_path, ratio=0.3)¶
- matchbench.model.entity_alignment.seu.load_triples(triples1, triples2, reverse=True)¶
- matchbench.model.entity_alignment.seu.test(sims, mode='sinkhorn', batch_size=1024)¶