matchbench.model.entity_alignment package

Submodules

matchbench.model.entity_alignment.Dual_AMN module

class matchbench.model.entity_alignment.Dual_AMN.Dual_AMN(triple_size=353543, node_size=39654, new_node_size=39654, rel_size=4224, nearest_sample_num=128, alpha=0.1, beta=0.1, gamma=1.0, depth=2, node_hidden=128, rel_hidden=128, dropout_rate=0.3, ind_dropout_rate=0.3, device=device(type='cuda', index=0))

Bases: EAModel

The class for Dual_AMN approach.

Class attributes:

Data/Feature generation triple_size (int): The size of triples in KG. node_size (int): The size of nodes in KG. new_node_size (int): The size of nodes calculated from all triples. rel_size (int): The size of relations in KG. nearest_sample_num (int): The generated candidates number when negtive sampling.

Model architecture / Loss alpha (float): The value of alpha. beta (float): The value of beta. gamma (float): The value of gamma. depth (int): The layer depth of NR_GraphAttention. node_hidden (int): The dimension of node representation. rel_hidden (int): The dimension of relation representation. dropout_rate (float): The probability of nn.Dropout function.

Others device (torch.device, optional, defaults to “cuda”): cuda or cpu.

encode(batch=None)

Convert a batch of entity token ids to a batch of entity vector embedding. In the default scenario, Dual_AMN encode all entities in KG.

Parameters:

batch (a dict of torch.tensor) – Keys: input_ids and attention_mask.

Returns:

The entity embedding.

Return type:

torch.tensor

get_emb(loader, device=device(type='cuda'))

Convert a list of entities token ids to a list of embedding after encoder. :param loader: :type loader: torch.data.utils.DataLoader :param device: cuda or cpu. :type device: torch.device, optional, defaults to “cuda”

Returns:

the output embeddings of the encoder.

Return type:

torch.tensor

load_source_target(dataset_src, dataset_tgt, stage=1)

Prepare source and target data and initialize Dual_AMN model :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset

predict(dataloader, device=device(type='cuda'))

Predict the closest entities of each entity in test set. :param dataloader: :type dataloader: torch.data.utils.DataLoader

Returns:

The result.

Return type:

numpy.array

prepare_dataloader(dataset, split='train', batch_size=128, stage=None, mid_file_dir=None)
For split train, prepare the dataloaders for Training.

For split valid and test, prepare the dataloaders only for evaluation.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • stage (int, optional, defaults to 1) – Dual_AMN only has one training stages.

  • batch_size (int, optional, defaults to 128) – Batch_size for the following negtive sampling or training.

  • mid_file_dir (str, optional, defaults to “middle_file/”) – The director where middle processed files stored.

Returns:

For split train, return torch.utils.data.Dataloader. For split valid and test, return tuple of torch.utils.data.Dataloader.

run_step(train_pairs: Tensor, device)

The whole process training one batch (step).

Parameters:
  • train_pairs (torch.tensor) –

  • device (str, optional, defaults to “cuda”) – cuda or cpu.

Returns:

loss

Return type:

torch.tensor

training: bool
class matchbench.model.entity_alignment.Dual_AMN.NR_GraphAttention(node_size, rel_size, triple_size, node_dim, depth=1, attn_heads=1, attn_heads_reduction='concat', use_bias=False)

Bases: Module

The class for RREA approach NR_GraphAttention encoder unit.

Class attributes:

node_size (int): The size of nodes in KG. rel_size (int): The size of relations in KG. triple_size (int): The size of triples in KG. node_dim (int): The node representation dimension of NR_GraphAttention. depth (int): The layer depth of NR_GraphAttention. attn_heads (int): The attention head number of NR_GraphAttention. attn_heads_reduction (str): The reduction ways of attention heads. use_bias (bool): If the model use bias or not.

forward(inputs)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool

matchbench.model.entity_alignment.LargeEA module

matchbench.model.entity_alignment.bertint module

class matchbench.model.entity_alignment.bertint.Basic_Bert_Unit_model(plm_path, basic_input_dim, basic_output_dim, dropout=0.1, encoder_model=None)

Bases: Module

The class for BertInt approach Basic_Bert_Unit_model encoder unit.

Class attributes:

plm_path (str): The local pretrained language model path. basic_input_dim (int): The input dimension of Basic_Bert_Unit_model. basic_output_dim (int): The output dimension of Basic_Bert_Unit_model. dropout (float): The probability of nn.Dropout function. encodermodel (transformers.PretrainedModel): User defined plm model.

forward(batch)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_alignment.bertint.BertInt(if_neg_sample=True, if_neg_sample_2=False, max_seq_length=128, nearest_sample_num=128, test_topk=50, basic_input_dim=768, basic_output_dim=300, mlp_input_dim=85, mlp_hidden_dim=11, dropout=0.1, margin_1=3, margin_2=1, plm_path=None, tokenizer=None, encodermodel=None)

Bases: EAModel

The class for BertInt approach.

Class attributes:

Data/Feature generation if_neg_sample (bool): If the approach need negtive sampling or not. max_seq_length (int): The max length of tokens fed into PLMs. get_emb_batch (int): The batch size when getting the PLM embedding of all the entities. nearest_sample_num (int): The generated candidates number when negtive sampling. test_topk(int): Max hit when testing.

Model architecture / Loss basic_input_dim (int): The input dimension of Basic_Bert_Unit_model. basic_output_dim (int): The output dimension of Basic_Bert_Unit_model. mlp_input_dim (int): The input dimension of MLP. mlp_hidden_dim (int): The output dimension of MLP. dropout (float): The probability of nn.Dropout function. margin_1 (int): The parameter in MarginRankingLoss in stage 1. margin_2 (int): The parameter in MarginRankingLoss in stage 2.

Others plm_path (str): The local pretrained language model path. tokenizer (transformers.Tokenizer): User defined plm tokenizer. encodermodel (transformers.PretrainedModel): User defined plm model.

calculate_loss(pos_score, neg_score, label)

Calculate the loss of the batch.

Parameters:
  • pos_score (torch.tensor) –

  • neg_score (torch.tensor) –

  • label (torch.label) –

Returns:

loss

Return type:

torch.tensor

chuliyixia(train, test)
compute_metric_stage_2(prediction)

Calculate the hits in stage 2. :param prediction: the top indexes of each testing entity. :type prediction: torch.tensor

Returns:

hits@1

Return type:

float

encode(batch)

Convert a batch of entity token ids to a batch of entity vector embedding.

Parameters:

batch (a dict of torch.tensor) – Keys: input_ids and attention_mask.

Returns:

The entity embedding.

Return type:

torch.tensor

forward(batch, device=device(type='cuda'))

Convert a batch of four kinds of entity token ids to a batch of postive/negtive score.

Parameters:

batch (a dict of torch.tensor) – Keys: pos1, pos2, neg1, neg2.

Returns:

postive/negtive score.

Return type:

tuple of torch.tensor

load_source_target(dataset_src, dataset_tgt)

Prepare source and target data. :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset

matcher(feature)

Applied in stage 2. From the relational features get the final score. :param feature: Relational features. :type feature: torch.tensor

Returns:

The final result.

Return type:

torch.tensor

negative_sample(block_loaders, device=device(type='cuda'), batch_size=24)

Negative sampling.

Parameters:
  • block_loaders (tuple of torch.utils.data.Dataloader) – four loaders described in prepare_dataloader.

  • device (torch.device, optional, defaults to “cuda”) – cuda or cpu.

  • batch_size (int, optional, defaults to 24) – batch_size for the following pairwise training.

Returns:

dataloader for the following pairwise training.

Return type:

torch.utils.data.Dataloader

predict_stage_2(dataloader, device=device(type='cuda'))

Predict the result in stage 2, rewriting the predict_stage_2 in base_model.

Parameters:

dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.

Returns:

The result.

Return type:

List

prepare_dataloader(dataset, split='train', stage=1, batch_size=128, mid_file_dir='bertint_middle_file/')
In stage 1, for split train, prepare the dataloaders for negtive sampling.

For split valid and test, prepare the dataloaders only for evaluation. In stage 2, prepare dataloaders for training and evaluation.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • stage (int, optional, defaults to 1) – BertInt has two training stages. A integer indicating which stage the training process in.

  • batch_size (int, optional, defaults to 128) – Batch_size for the following negtive sampling or training.

  • mid_file_dir (str, optional, defaults to “middle_file/”) – The director where middle processed files stored.

Returns:

In stage 1, for split train, return two train loaders and two all entities loaders. For split valid and test, return two valid/test loaders. In stage 2, return RelationalDataloader for split train and return torch.utils.data.Dataloader for others.

Return type:

tuple of torch.utils.data.Dataloader

run_step(batch, stage=1, device=device(type='cuda'))

The whole process training one batch (step).

Parameters:
  • batch (tuple of torch.tensor) –

  • stage (int, optional, defaults to 1) –

  • device (str, optional, defaults to “cuda”) – cuda or cpu.

Returns:

loss

Return type:

torch.tensor

serialize(des)

Convert entity descriptions to token ids which can be directly fed into PLMs.

Parameters:

des (dict) – Key: entity name. Value: entity description.

Returns:

Key: entity id. Value: entity token ids.

Return type:

dict

training: bool
class matchbench.model.entity_alignment.bertint.MLP(mlp_input_dim, mlp_hidden_dim)

Bases: Module

The class for BertInt approach MLP matcher unit.

Class attributes:

mlp_input_dim (str): The input dimension of MLP. mlp_hidden_dim (int): The output dimension of MLP.

forward(features)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_alignment.bertint.PairwiseDataset(train_tups, ent2data)

Bases: Dataset

The class of Dataset used for training in stage 1.

Class attributes:

train_tups(list): Training indexes pairs. ent2data(dict): entity id to token entity embedding.

class matchbench.model.entity_alignment.bertint.RelationalDataloader(train_ill, train_candidate, entpair2f_idx, all_features, neg_num, batch_size)

Bases: object

The class of Dataset used for training in stage 2.

train_pair_index_gene()

generate training data (entity_index).

matchbench.model.entity_alignment.bertint.all_entity_pairs_gene(candidate_dict_list, ill_pair_list)
matchbench.model.entity_alignment.bertint.attributeValue_emb_gene(l_set, model, Tokenizer, batch_size, max_length, device=device(type='cuda'))

generate attributeValue embedding by basic bert unit

matchbench.model.entity_alignment.bertint.attributeView_interaction_F_gene(ent_pairs, value_emb_list, ent2valueids, value_pad_id, kernel_num=21, batch_size=512, device=device(type='cuda'))

Attribute-View Interaction. use Dual Aggregation and Attribute-View Interaction to generate Similarity Feature between entity pairs. return entity pairs and features(between entity pairs)

matchbench.model.entity_alignment.bertint.batch_dual_aggregation_feature_gene(batch_sim_matrix, mus, sigmas, attn_ne1, attn_ne2)

Dual Aggregation. [similarity matrix -> feature] :param batch_sim_matrix: [B,ne1,ne2] :param mus: [1,1,k(kernel_num)] :param sigmas: [1,1,k] :param attn_ne1: [B,ne1,1] :param attn_ne2: [B,ne2,1] :return feature: [B,kernel_num * 2].

matchbench.model.entity_alignment.bertint.candidate_generate(ents1, ents2, ent_emb, candidate_topk=50, bs=32, device=device(type='cuda'))

return a dict, key = entity, value = candidates (likely to be aligned entities)

matchbench.model.entity_alignment.bertint.clean_attribute_data(dataset_src, dataset_tgt, mid_file_dir='bertint_middle_file/')
matchbench.model.entity_alignment.bertint.desornameView_interaction_F_gene(ent_pairs, e_emb_list, batch_size=512, device=device(type='cuda'))
matchbench.model.entity_alignment.bertint.dump_other_data(train_ents1, train_ents2, test_ents1, test_ents2, ent2data, mid_file_dir)
matchbench.model.entity_alignment.bertint.ent2attributeValues_gene(entid_list, att_datas, max_length, pad_value=None)

get attribute Values of entity return a dict, key = entity ,value = (padding) attribute_values of entity

matchbench.model.entity_alignment.bertint.get_attributeValue_embedding(model, batch_size=256, mid_file_dir='bertint_middle_file/', device=device(type='cuda'))
matchbench.model.entity_alignment.bertint.get_attributeView_interaction_feature(model, mid_file_dir='bertint_middle_file/', device=device(type='cuda'))
matchbench.model.entity_alignment.bertint.get_attribute_value_type(value, value_type)
matchbench.model.entity_alignment.bertint.get_entity_embedding(model, batch_size=256, mid_file_dir='bertint_middle_file/', device=device(type='cuda'))
matchbench.model.entity_alignment.bertint.get_neighView_and_desView_interaction_feature(model, dataset_src, dataset_tgt, batch_size=256, mid_file_dir='bertint_middle_file/', device=device(type='cuda'))
matchbench.model.entity_alignment.bertint.get_tokens_of_value(vaule_list, Tokenizer, max_length)
matchbench.model.entity_alignment.bertint.kernel_mus(n_kernels)
matchbench.model.entity_alignment.bertint.kernel_sigmas(n_kernels)
matchbench.model.entity_alignment.bertint.neigh_ent_dict_gene(rel_triples, max_length, pad_id=None)

get one hop neighbor of entity return a dict, key = entity, value = (padding) neighbors of entity

matchbench.model.entity_alignment.bertint.neighborView_interaction_F_gene(ent_pairs, ent_emb_list, neigh_dict, ent_pad_id, kernel_num=21, batch_size=512, device=device(type='cuda'))

Neighbor-View Interaction. use Dual Aggregation and Neighbor-View Interaction to generate Similarity Feature between entity pairs. return entity pairs and features(between entity pairs)

matchbench.model.entity_alignment.bertint.padding_to_longest(token_list, Tokenizer)
matchbench.model.entity_alignment.bertint.read_att_data(dataset)

load attribute triples file.

matchbench.model.entity_alignment.bertint.read_attribute_datas(kg1_att_file_name, kg2_att_file_name, entity_list, entity2index, add_name_as_attTriples=True)

return list of attribute triples [(entity_id,attribute,attributeValue,type of attributeValue)]

matchbench.model.entity_alignment.bertint.remove_one_to_N_att_data_by_threshold(ori_keep_data, ori_remove_data, one2N_threshold)

Filter noise attribute triples based on threshold

matchbench.model.entity_alignment.bertint.sort_a(data_list)

sort

matchbench.model.entity_alignment.bertint.test_read_emb(ent_emb, train_ill, test_ill, bs=128, candidate_topk=50)

matchbench.model.entity_alignment.kecg module

class matchbench.model.entity_alignment.kecg.GAT(n_units, n_heads, dropout, attn_dropout, instance_normalization, diag)

Bases: Module

forward(x, adj)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_alignment.kecg.MultiHeadGraphAttention(n_head, f_in, f_out, attn_dropout, diag=True, init=None, bias=False)

Bases: Module

forward(input, adj)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_alignment.kecg.SpecialSpmm

Bases: Module

forward(indices, values, shape, b)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_alignment.kecg.SpecialSpmmFunction

Bases: Function

Special function for only sparse region backpropataion layer.

static backward(ctx, grad_output)

Defines a formula for differentiating the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by as many outputs did forward() return, and it should return as many tensors, as there were inputs to forward(). Each argument is the gradient w.r.t the given output, and each returned value should be the gradient w.r.t. the corresponding input.

The context can be used to retrieve tensors saved during the forward pass. It also has an attribute ctx.needs_input_grad as a tuple of booleans representing whether each input needs gradient. E.g., backward() will have ctx.needs_input_grad[0] = True if the first input to forward() needs gradient computated w.r.t. the output.

static forward(ctx, indices, values, shape, b)

Performs the operation.

This function is to be overridden by all subclasses.

It must accept a context ctx as the first argument, followed by any number of arguments (tensors or other types).

The context can be used to store tensors that can be then retrieved during the backward pass.

matchbench.model.entity_alignment.lightea module

class matchbench.model.entity_alignment.lightea.LightEA(ent_dim=1024, depth=2, top_k=500, predict_epochs=10, using_name_features=True)

Bases: EAModel

The class for LightEA approach.

Parameters:
  • ent_dim (int) – The entity embedding dimension.

  • top_k (int) – The number of closest entities considered.

  • predict_epochs (int) – Iteration number when predicting.

  • using_name_features (bool) – Use the translated English name or not.

batch_sparse_matmul(sparse_tensor, dense_tensor, batch_size=128, save_mem=False)
compute_metric(result, stage=1)

Calculate the hits. :param prediction: the top indexes of each testing entity. :type prediction: torch.tensor

Returns:

hits@1

Return type:

float

get_features(train_pair, extra_feature=None)
load_source_target(dataset_src, dataset_tgt, mid_file_dir='middle_file/', has_mid_files=False)

Prepare source and target data. :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset :param mid_file_dir: The director where middle processed files stored. :type mid_file_dir: str, optional, defaults to “middle_file/” :param has_mid_files: If the middle files has already been generated. :type has_mid_files: bool

predict(dataset, stage=1, train_dataloader=None)
For LightEA approach, the train process and test process are implemented together

and cannot be separated.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • stage (int, optional, defaults to 1) – Unused in LightEA.

Returns:

The hits@1 score.

Return type:

float

prepare_dataloader(dataset, split='test', batch_size=32, stage=1)
For split train, prepare the pairs for training.

For split test, prepare the pairs for testing.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • batch_size (int, optional, defaults to 32) – Unused.

  • stage (int, optional, defaults to 1) – Unused.

Returns:

Train or test pairs.

Return type:

np.array

random_projection(x, out_dim)
segment_sum(data, segment_ids)
sparse_sinkhorn_sims(left, right, features, top_k=500, iteration=15, mode='test', epoch=0)
test(test_pair, features, top_k=500, iteration=15)
training: bool
matchbench.model.entity_alignment.lightea.load_aligned_pair(file_path, ratio=0.3)
matchbench.model.entity_alignment.lightea.load_graph(path)
matchbench.model.entity_alignment.lightea.load_name_features(dataset_src, dataset_tgt, vector_path, mode='word-level', mid_file_dir='middle_file/')

matchbench.model.entity_alignment.rrea module

class matchbench.model.entity_alignment.rrea.NR_GraphAttention(node_size, rel_size, triple_size, node_dim, depth=1, attn_heads=1, attn_heads_reduction='concat', use_bias=False, activation='relu')

Bases: Module

The class for RREA approach NR_GraphAttention encoder unit.

Class attributes:

node_size (int): The size of nodes in KG. rel_size (int): The size of relations in KG. triple_size (int): The size of triples in KG. node_dim (int): The node representation dimension of NR_GraphAttention. depth (int): The layer depth of NR_GraphAttention. attn_heads (int): The attention head number of NR_GraphAttention. attn_heads_reduction (str): The reduction ways of attention heads. use_bias (bool): If the model use bias or not. activation(str): The activation funciton of NR_GraphAttention.

forward(inputs)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_alignment.rrea.RREA(alpha=0.1, beta=0.1, gamma=3, depth=2, node_hidden=100, rel_hidden=100, dropout_rate=0.3, triple_size=353543, node_size=39654, new_node_size=39654, rel_size=4224, if_neg_sample=True, nearest_sample_num=128, device=device(type='cuda', index=0))

Bases: EAModel

The class for RREA approach.

Class attributes:

Data/Feature generation triple_size (int): The size of triples in KG. node_size (int): The size of nodes in KG. new_node_size (int): The size of nodes calculated from all triples. rel_size (int): The size of relations in KG. if_neg_sample (bool): If the approach need negtive sampling or not. nearest_sample_num (int): The generated candidates number when negtive sampling.

Model architecture / Loss alpha (float): The value of alpha. beta (float): The value of beta. gamma (float): The value of gamma. depth (int): The layer depth of NR_GraphAttention. node_hidden (int): The dimension of node representation. rel_hidden (int): The dimension of relation representation. dropout_rate (float): The probability of nn.Dropout function.

Others device (torch.device, optional, defaults to “cuda”): cuda or cpu.

encode(batch=None)

Convert a batch of entity token ids to a batch of entity vector embedding. In the default scenario, RREA encode all entities in KG.

Parameters:

batch (a dict of torch.tensor) – Keys: input_ids and attention_mask.

Returns:

The entity embedding.

Return type:

torch.tensor

get_emb(loader, device=device(type='cuda'))

Convert a list of entities token ids to a list of embedding after encoder. :param loader: :type loader: torch.data.utils.DataLoader :param device: cuda or cpu. :type device: torch.device, optional, defaults to “cuda”

Returns:

the output embeddings of the encoder.

Return type:

torch.tensor

load_source_target(dataset_src, dataset_tgt, stage=1)

Prepare source and target data and initialize RREA model :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset

negative_sample(train_pair, batch_size=24, device=None)

Negative sampling.

Parameters:
  • train_pair (np.array) – numpy array described in prepare_dataloader.

  • batch_size (int, optional, defaults to 24) – batch_size for the following pairwise training.

  • device (torch.device, optional, defaults to “cuda”) – cuda or cpu.

Returns:

dataloader for the following pairwise training.

Return type:

torch.utils.data.Dataloader

predict(dataloader, device=device(type='cuda'), stage=None, train_dataloader=None)

Predict the result, rewriting the predict in base_model.

Parameters:
  • dataloader (torch.utils.data.DataLoader) – Dataloader for prediction.

  • device (torch.device, optional, defaults to “cuda”) – cuda or cpu.

Returns:

The result.

Return type:

numpy.array

prepare_dataloader(dataset, split='train', batch_size=128, stage=None, mid_file_dir=None)
For split train, prepare the dataloaders for Training.

For split valid and test, prepare the dataloaders only for evaluation.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • stage (int, optional, defaults to 1) – RREA only has one training stages.

  • batch_size (int, optional, defaults to 128) – Batch_size for the following negtive sampling or training.

  • mid_file_dir (str, optional, defaults to “middle_file/”) – The director where middle processed files stored.

Returns:

For split train, return np.array. For split valid and test, return tuple of torch.utils.data.Dataloader.

run_step(train_pairs: Tensor, device)

The whole process training one batch (step).

Parameters:
  • train_pairs (torch.tensor) –

  • device (str, optional, defaults to “cuda”) – cuda or cpu.

Returns:

loss

Return type:

torch.tensor

training: bool

matchbench.model.entity_alignment.sdea module

class matchbench.model.entity_alignment.sdea.Basic_Bert_Unit_model(plm_path='/home/wangp/SDEA-main/pre_trained_models/bert-base-multilingual-uncased', basic_input_dim=768, basic_output_dim=300, dropout=0.1, encoder_model=None)

Bases: Module

The class for SDEA approach Basic_Bert_Unit_model unit.

Parameters:
  • plm_path (str) – The local pretrained language model path.

  • basic_input_dim (int) – The input dimension of Basic_Bert_Unit_model.

  • basic_output_dim (int) – The output dimension of Basic_Bert_Unit_model.

  • dropout (float) – The probability of nn.Dropout function.

  • encoder_model (transformers.PretrainedModel) – User defined plm model.

plm_path

The local pretrained language model path.

Type:

str

basic_input_dim

The input dimension of Basic_Bert_Unit_model.

Type:

int

basic_output_dim

The output dimension of Basic_Bert_Unit_model.

Type:

int

dropout

The probability of nn.Dropout function.

Type:

float

bert_model

The encoder model used in SDEA.

Type:

transformers.PretrainedModel

forward(batch)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_alignment.sdea.BertDataLoader(dataset)

Bases: object

The class mainly serves to tokenize words and strings to token ids.

static line_to_feature(line: str, tokenizer: BertTokenizer)
load_data(line_solver)
static load_freq(dataset)
static load_saved_data(dataset)
run()
save_data(datas)
save_token_freq(datas) dict
matchbench.model.entity_alignment.sdea.DBPpreprocess(dataset_att)
class matchbench.model.entity_alignment.sdea.Dataset(no, middle_file)

Bases: object

The class recording the suffixes of middle files.

static outputs_csv(name, no, path)
static outputs_python(name, no, path)
static outputs_tab(name, no, path)
class matchbench.model.entity_alignment.sdea.GRUAttnNet(embed_dim, hidden_dim, hidden_layers, dropout=0, device: device = 'cpu')

Bases: Module

attn_net_with_w(rnn_out, rnn_hn, neighbor_mask: Tensor, x)
Parameters:
  • rnn_out – [batch_size, seq_len, n_hidden * 2]

  • rnn_hn – [batch_size, num_layers * num_directions, n_hidden]

Returns:

build_model(dropout)
forward(x, neighbor_mask)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class matchbench.model.entity_alignment.sdea.Highway(dim, device: device)

Bases: Module

forward(x)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

init_Linear(in_fea, out_fea, bias)
training: bool
class matchbench.model.entity_alignment.sdea.KBStore(dataset=None, dataset_path: Optional[Dataset] = None, relation=1)

Bases: object

The class generate all used middle files, include relation tokens, property tokens, etc.

add_fact(sbj_id, pred_id, obj_id, facts_list: dict) None
add_item(name: str, names: list, ids: dict) int
add_to_blocks(sbj_id, obj_id) None
add_tuple(sbj: str, pred: str, obj: str, file_type: OEAFileType) None
add_word_level_blocks(entity_id, words)
static calculate_func(r_names: list, r_ids: dict, facts_list: dict, sbj_ids: dict) list
get_or_add_item(name: str, names: list, ids: dict) int
get_property_table_line(lines)
load(file_type: OEAFileType) None
load_entities()
load_facts()
load_kb() None
load_kb_from_saved()
load_literals()
load_properties()
load_relations()
save_base_info()
save_datas()
save_facts()
save_property_table()
save_seq_form(dicts, header)
class matchbench.model.entity_alignment.sdea.OEAFileType(value)

Bases: Enum

The triple types.

attr = 0
rel = 1
ttl_full = 2
class matchbench.model.entity_alignment.sdea.PairwiseDataset(train_tups, ent2data1, ent2data2)

Bases: Dataset

class matchbench.model.entity_alignment.sdea.RelationDataset(train_tups_r, fs1: KBStore, fs2: KBStore, ent2embed1: Tensor, ent2embed2: Tensor)

Bases: Dataset

static get_matrix(neighbors, nm_len, pad_idx)
static get_neighbor_matrix(facts: dict)
class matchbench.model.entity_alignment.sdea.RelationModel(rel_count1, rel_count2, all_embed1_size, all_embed2_size, score_distance_level=2, margin=1, device=device(type='cuda'))

Bases: Module

The class for SDEA approach RelationModel unit.

Parameters:
  • rel_count1 (int) – The relations count in source graph.

  • rel_count2 (int) – The relations count in target graph.

  • all_embed1_size (int) – Embedding size i.e. The number of all entities.

  • all_embed2_size (int) – Embedding size i.e. The number of all entities.

  • score_distance_level (int) – Distance level in pairwise distance.

  • margin (int) – Margin in marginrankingloss.

  • device (torch.device) – cuda or cpu.

rnn

Relation embedding module.

Type:

nn

combiner

Combiner module.

Type:

nn

ent_embedding1

Entity embedding module..

ent_embedding2

Entity embedding module.

case_study(batch, rel_embedding: Embedding, all_embed, mode)
forward(pe1s, pe2s, ne1s, ne2s, bpn1s, bpn2s, bnn1s, bnn2s, bpr1s, bpr2s, bnr1s, bnr2s)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_emb(batch, rel_embedding: Embedding, all_embed, mode)
get_ent_embedding(all_embed1, all_embed2)
static get_neighbors_batch(batch_facts, pad_idx, pad_idxr=None)
get_rel_embeds(batch_neighbors, batch_relations, batch_ent: Tensor, rel_embedding, all_embed: Embedding)
static pos_neg_count(y_true: Tensor, batch_size: int)
training: bool
class matchbench.model.entity_alignment.sdea.RelationValidDataset(ents, fs: KBStore, all_embeds, batch_size)

Bases: object

class matchbench.model.entity_alignment.sdea.SDEA(if_neg_sample=True, max_seq_len=128, nearest_sample_num=128, basic_input_dim=768, basic_output_dim=300, dropout=0.1, margin=1, score_distance_level=2, plm_path=None, tokenizer=None, encodermodel=None)

Bases: EAModel

The class for SDEA approach RelationModel unit.

Parameters:
  • if_neg_sample (bool) – If the approach need negtive sampling or not.

  • max_seq_length (int) – The max length of tokens fed into PLMs.

  • nearest_sample_num (int) – The generated candidates number when negtive sampling.

  • Loss (Model architecture /) –

  • basic_input_dim (int) – The input dimension of Basic_Bert_Unit_model.

  • basic_output_dim (int) – The output dimension of Basic_Bert_Unit_model.

  • dropout (float) – The probability of nn.Dropout function.

  • margin (int) – The parameter in MarginRankingLoss in stage 1.

  • Others

  • plm_path (str) – The local pretrained language model path.

  • tokenizer (transformers.Tokenizer) – User defined plm tokenizer.

  • encodermodel (transformers.PretrainedModel) – User defined plm model.

if_neg_sample_2

If the approach need negtive sampling or not in second stage.

Type:

bool

all_embed1s

All entity embeddings.

Type:

tensor

all_embed2s

All entity embeddings.

Type:

tensor

entity_mode

If the entity relations have ids or directly the names, in name or id

Type:

str

a(all_embed1s_p, all_embed2s_p, device=None)
calculate_loss(pos_score, neg_score, label)

Calculate the loss of the batch.

Parameters:
  • pos_score (torch.tensor) –

  • neg_score (torch.tensor) –

  • label (torch.label) –

Returns:

loss

Return type:

torch.tensor

class_name_str()
encode(batch)

Convert a batch of entity token ids to a batch of entity vector embedding.

Parameters:

des (a dict of torch.tensor) – Keys: input_ids and attention_mask.

Returns:

The entity embedding.

Return type:

torch.tensor

forward(batch, device=device(type='cuda'))

Convert a batch of four kinds of entity token ids to a batch of postive/negtive score.

Parameters:

des (a dict of torch.tensor) – Keys: pos1, pos2, neg1, neg2.

Returns:

postive/negtive score.

Return type:

tuple of torch.tensor

static get_tensor_data(ents: list, eid2data: dict)
load_source_target(dataset_src, dataset_tgt, mid_file_dir='middle_file/', has_mid_files=False)

Prepare source and target data. :param dataset_src: Source dataset. :type dataset_src: datasets.arrow_dataset.Dataset :param dataset_tgt: Target dataset. :type dataset_tgt: datasets.arrow_dataset.Dataset :param mid_file_dir: The director where middle processed files stored. :type mid_file_dir: str, optional, defaults to “middle_file/” :param has_mid_files: If the middle files has already been generated. :type has_mid_files: bool

negative_sample(block_loaders, device=device(type='cuda'), batch_size=24, stage=1)

Negtive sampling.

Parameters:
  • block_loaders (tuple of torch.utils.data.Dataloader) – four loaders described in prepare_dataloader.

  • device (torch.device, optional, defaults to “cuda”) – cuda or cpu.

  • batch_size (int, optional, defaults to 24) – batch_size for the following pairwise training.

  • stage (int, optional, defaults to 1) – A integer indicating which stage the training process in.

Returns:

dataloader for the following pairwise training.

Return type:

torch.utils.data.Dataloader

oea_truth_line(line)
prepare_dataloader(dataset, split='train', batch_size=128, device=device(type='cuda'), stage=1)
For split train, prepare the dataloaders for negtive sampling.

For split valid and test, prepare the dataloaders only for evaluation.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/valid/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • stage (int, optional, defaults to 1) – SDEA has two training stages. A integer indicating which stage the training process in.

  • batch_size (int, optional, defaults to 128) – Batch_size for the following negtive sampling.

Returns:

For split train, return two train loaders and two all entities loaders. For split valid and test, return two valid/test loaders.

Return type:

tuple of torch.utils.data.Dataloader

static reduce_tokens(tids, max_len=200)
static reduce_tokens_with_freq(tids, freqs: dict, max_len=200)
run_step(batch, stage=1, device=device(type='cuda'))

The whole process training one batch (step).

Parameters:
  • batch (tuple of torch.tensor) –

  • stage (int, optional, defaults to 1) –

  • device (str, optional, defaults to “cuda”) – cuda or cpu.

Returns:

loss

Return type:

torch.tensor

training: bool
matchbench.model.entity_alignment.sdea.compress_uri(uri, step=1)
matchbench.model.entity_alignment.sdea.load_list(file: str)
matchbench.model.entity_alignment.sdea.load_list_p(file: str)
matchbench.model.entity_alignment.sdea.oea_attr_line(fact, step=1)
matchbench.model.entity_alignment.sdea.oea_rel_line(fact)
matchbench.model.entity_alignment.sdea.save_dict_p(dic: dict, file: str)
matchbench.model.entity_alignment.sdea.save_list(l, file)
matchbench.model.entity_alignment.sdea.save_list_p(l, file: str)
matchbench.model.entity_alignment.sdea.stripSquareBrackets(s)
matchbench.model.entity_alignment.sdea.strip_square_brackets(s)
matchbench.model.entity_alignment.sdea.text_to_word_sequence(text, filters='!"\'#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
matchbench.model.entity_alignment.sdea.ttl_no_compress_line(line)

matchbench.model.entity_alignment.seu module

class matchbench.model.entity_alignment.seu.SEU(depth=2, mode='hungarian')

Bases: EAModel

The class for SEU approach.

Parameters:
  • depth (int) – The iteration numbers.

  • mode (str) – The mode of the assignment operation, either Hungarian or Sinkhorn.

cal_sims(test_pair, feature)
compute_metric(result)

Calculate the hits. :param prediction: the top indexes of each testing entity. :type prediction: torch.tensor

Returns:

hits@1

Return type:

float

load_source_target(dataset_src, dataset_tgt, mid_file_dir='middle_file/', has_mid_files=False)

Prepare source and target data.

Parameters:
  • dataset_src (datasets.arrow_dataset.Dataset) – Source dataset.

  • dataset_tgt (datasets.arrow_dataset.Dataset) – Target dataset.

  • mid_file_dir (str, optional, defaults to “middle_file/”) – The director where middle processed files stored.

  • has_mid_files (bool) – If the middle files has already been generated.

predict(dataset, stage=1, train_dataloader=None)
For SEU approach, the train process and test process are implemented together

and cannot be separated.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • stage (int, optional, defaults to 1) – Unused in SEU.

Returns:

The hits@1 score.

Return type:

float

prepare_dataloader(dataset, split='test', batch_size=1024, stage=1)

prepare the pairs for both training and testing.

Parameters:
  • dataset (datasets.arrow_dataset.Dataset) – Train/test pairs.

  • split (str, optional, defaults to “train”) – A string indicating which split the dataset is.

  • batch_size (int, optional, defaults to 32) – Unused.

  • stage (int, optional, defaults to 1) – Unused.

Returns:

Train or test pairs.

Return type:

np.array

training: bool
matchbench.model.entity_alignment.seu.load_aligned_pair(file_path, ratio=0.3)
matchbench.model.entity_alignment.seu.load_triples(triples1, triples2, reverse=True)
matchbench.model.entity_alignment.seu.test(sims, mode='sinkhorn', batch_size=1024)

Module contents