matchbench.model.schema_matching package¶
Submodules¶
matchbench.model.schema_matching.embdi module¶
- class matchbench.model.schema_matching.embdi.Edge(node_from, node_to, weight_forward=1, weight_back=1)¶
Bases:
object
- class matchbench.model.schema_matching.embdi.EdgeList(df, prefixes=['3#__tn', '3$__tt', '5$__idx', '1$__cid'], info_file=None, smoothing_method='no', flatten=False)¶
Bases:
object
Edgelist class. This class contains all the methods and attributes needed to build a graph for EmbDI. It also includes a number of functions for the assignment of weights to the edges.
- static convert_cell_value(original_value)¶
Convert cell values to strings. Round float values. :param original_value: The value to convert to str. :type original_value: Any
- Returns:
The converted value.
- Return type:
str
- convert_to_dict()¶
- convert_to_numeric()¶
- static evaluate_frequencies(flatten, df, intersection)¶
- static f_no_smoothing()¶
Uniform weights.
- static find_intersection_flatten(df, info_file)¶
- get_edgelist()¶
Return the list of edges. :returns:
- List that contains edges in format
(node_1, node_2, weight_12, weight_21)
- Return type:
list
- get_prefixes()¶
- static inverse_freq(freq)¶
- static inverse_smooth(x, s)¶
- static log_freq(freq, base=10)¶
- static prepare_split(cell_value, flatten, intersection)¶
- static smooth_exp(x, eps=0.01, target=10, k=0.5)¶
Weight function that assigns weights based on a node’s frequency. Nodes with degree 1 have the highest weight (1), nodes with higher degree are assigned a decreasing weight with minimum value k. The parameter “target” is the frequency beyond which each node will receive the lowest weight. The function is defined in such a way that f(target)~=k+eps. :param x: The vector of frequencies. :type x: np.Array :param eps: Internal parameter. Defaults to 0.01. :type eps: float :param target: Frequency value beyond which all nodes are :type target: int :param assigned the minimum weight. Defaults to 10.: :param k: Value of the minimum value. Defaults to 0.5. :type k: float
- Returns:
Weight vector.
- Return type:
np.Array
- smooth_freq(freq, eps=0.01)¶
- class matchbench.model.schema_matching.embdi.Embdi(sentence_length: int = 60, backtrack: bool = True, repl_numbers: bool = False, repl_strings: bool = False)¶
Bases:
SMModel
- load_source_target(dataset_src, dataset_tgt, **kwargs)¶
Prepare source and target data. :param data_src: Source dataset. :type data_src: datasets.arrow_dataset.Dataset :param data_tgt: Target dataset. :type data_tgt: datasets.arrow_dataset.Dataset
- predict()¶
Make predictions. :returns: Match results. :rtype: list
- train(dimensions, window_size, training_algorithm, learning_method, workers, sampling_factor)¶
Function used to train the embeddings based on the given walks corpus. Multiple parameters are available to tweak the training procedure. The resulting embedding file will be saved in the given path to be used later in the experimental phase. :param dimensions: Number of dimensions to be used when training the model. :type dimensions: int :param window_size: Size of the context window. :type window_size: int :param training_algorithm: Either fasttext or word2vec. :type training_algorithm: str :param learning_method: Skipgram or CBOW. :type learning_method: str :param workers: Number of CPU workers to be used in during the training. :type workers: int :param Default = mp.cpu_count().:
- class matchbench.model.schema_matching.embdi.Graph(edgelist, prefixes, sim_list=None, flatten=[])¶
Bases:
object
- add_edge(node_from, node_to, weight_forward, weight_back=None)¶
- add_similarities(sim_list)¶
- compute_n_sentences(sentence_length, factor=1000)¶
Compute the default number of sentences according to the rule of thumb: n_sentences = n_nodes * representation_factor // sentence_length :param sentence_length: target sentence length :type sentence_length: int :param factor: “desired” number of occurrences of each node :type factor: int
- Returns:
n_sentences
- Return type:
int
- get_graph()¶
- get_node_list()¶
- produce_intersection(intersecting_nodes)¶
- class matchbench.model.schema_matching.embdi.Node(name, type, node_class, numeric)¶
Bases:
object
Cell class used to describe the nodes that build the graph.
- add_neighbor(neighbor, weight)¶
- add_similar(other, distance)¶
- get_random_neighbor()¶
- get_random_replacement()¶
- get_random_start()¶
- normalize_neighbors()¶
- set_frequency(frequency)¶