matchbench.model.schema_matching package

Submodules

matchbench.model.schema_matching.embdi module

class matchbench.model.schema_matching.embdi.Edge(node_from, node_to, weight_forward=1, weight_back=1)

Bases: object

class matchbench.model.schema_matching.embdi.EdgeList(df, prefixes=['3#__tn', '3$__tt', '5$__idx', '1$__cid'], info_file=None, smoothing_method='no', flatten=False)

Bases: object

Edgelist class. This class contains all the methods and attributes needed to build a graph for EmbDI. It also includes a number of functions for the assignment of weights to the edges.

static convert_cell_value(original_value)

Convert cell values to strings. Round float values. :param original_value: The value to convert to str. :type original_value: Any

Returns:

The converted value.

Return type:

str

convert_to_dict()
convert_to_numeric()
static evaluate_frequencies(flatten, df, intersection)
static f_no_smoothing()

Uniform weights.

static find_intersection_flatten(df, info_file)
get_edgelist()

Return the list of edges. :returns:

List that contains edges in format

(node_1, node_2, weight_12, weight_21)

Return type:

list

get_prefixes()
static inverse_freq(freq)
static inverse_smooth(x, s)
static log_freq(freq, base=10)
static prepare_split(cell_value, flatten, intersection)
static smooth_exp(x, eps=0.01, target=10, k=0.5)

Weight function that assigns weights based on a node’s frequency. Nodes with degree 1 have the highest weight (1), nodes with higher degree are assigned a decreasing weight with minimum value k. The parameter “target” is the frequency beyond which each node will receive the lowest weight. The function is defined in such a way that f(target)~=k+eps. :param x: The vector of frequencies. :type x: np.Array :param eps: Internal parameter. Defaults to 0.01. :type eps: float :param target: Frequency value beyond which all nodes are :type target: int :param assigned the minimum weight. Defaults to 10.: :param k: Value of the minimum value. Defaults to 0.5. :type k: float

Returns:

Weight vector.

Return type:

np.Array

smooth_freq(freq, eps=0.01)
class matchbench.model.schema_matching.embdi.Embdi(sentence_length: int = 60, backtrack: bool = True, repl_numbers: bool = False, repl_strings: bool = False)

Bases: SMModel

load_source_target(dataset_src, dataset_tgt, **kwargs)

Prepare source and target data. :param data_src: Source dataset. :type data_src: datasets.arrow_dataset.Dataset :param data_tgt: Target dataset. :type data_tgt: datasets.arrow_dataset.Dataset

predict()

Make predictions. :returns: Match results. :rtype: list

train(dimensions, window_size, training_algorithm, learning_method, workers, sampling_factor)

Function used to train the embeddings based on the given walks corpus. Multiple parameters are available to tweak the training procedure. The resulting embedding file will be saved in the given path to be used later in the experimental phase. :param dimensions: Number of dimensions to be used when training the model. :type dimensions: int :param window_size: Size of the context window. :type window_size: int :param training_algorithm: Either fasttext or word2vec. :type training_algorithm: str :param learning_method: Skipgram or CBOW. :type learning_method: str :param workers: Number of CPU workers to be used in during the training. :type workers: int :param Default = mp.cpu_count().:

training: bool
class matchbench.model.schema_matching.embdi.Graph(edgelist, prefixes, sim_list=None, flatten=[])

Bases: object

add_edge(node_from, node_to, weight_forward, weight_back=None)
add_similarities(sim_list)
compute_n_sentences(sentence_length, factor=1000)

Compute the default number of sentences according to the rule of thumb: n_sentences = n_nodes * representation_factor // sentence_length :param sentence_length: target sentence length :type sentence_length: int :param factor: “desired” number of occurrences of each node :type factor: int

Returns:

n_sentences

Return type:

int

get_graph()
get_node_list()
produce_intersection(intersecting_nodes)
class matchbench.model.schema_matching.embdi.Node(name, type, node_class, numeric)

Bases: object

Cell class used to describe the nodes that build the graph.

add_neighbor(neighbor, weight)
add_similar(other, distance)
get_random_neighbor()
get_random_replacement()
get_random_start()
normalize_neighbors()
set_frequency(frequency)
class matchbench.model.schema_matching.embdi.RandomWalk(graph_nodes, starting_node_name, sentence_len, backtrack, uniform=True, repl_strings=True, repl_numbers=True, follow_replacement=False)

Bases: object

get_both_walks()
get_reversed_walk()
get_walk()
replace_numeric_value(value, nodes)
replace_string_value(value: Node)

Module contents