前言

《Text-Augmented Knowledge Representation Learning Based on Convolutional Network》阅读笔记

Introduction Notes

Knowledge Graphs have the relationship triples (subject, relation, object) denoted as (h,r,t), which are useful for IR tasks.

KGs completion or link prediction

Some research focused on the knowledge graph completion or link prediction task which aims to predict missing triples in KGs.

$\text{Rescal model}$: Matrix decomposition for knowledge representation learning.
$\text{DistMult model}$: Set relation matrix as diagonal matrix; Use “circular correlation” of the head and tail entity vectors to represent the entity pair.
$\text{ConvE model}$: The first model applying convolutional neural network for the knowledge graph completion task.
$\text{ConvKB model}$: Apply a convolutional neural network to calculate the fraction of the input triples.
Importance of Text Descriptions
The semantic expression between entities can identify true triples.
For example:
the triple (Barack Hussein Obama Sr, parent of, Barack Obama)
Indicates “Barack Hussein Obama Sr” is the parent of “Barack Obama”.
We don’t know “parent of” refers to “father” or “mother”.

The text description helps confirm the true entity.

In the text description of the true entity, the keywords are “he” and “his wife”.
The false entity text description has “she” and “mother”.
Description information can improve task performance such as link prediction.

TA-ConvKB Model

Because ConvKB model only learn triples from KG and does not utilize description information. The author raised a text augmented method based on ConvKB (TA-ConvKB). It combines the information of factual triples and descriptions to enhance the accuracy of knowledge graphs link prediction.

$\textbf{Steps}$

Pretrain the entities descriptions and extract a multiple of keywords.
Use fastText to encode those words into vectors.
Input word embeddings into the A-BiLSTM encoder to extract distinctive features.
Acquire a text vector that best expresses the entity information
Use the gate mechanism (inspired by LSTM gate units) to integrate structural embeddings obtained by StransE model and textual embeddings.

Contributions

Introduce the TA-ConvKB model and use A-BiLSTM to encode entities descriptions.
Use a brand-new gate mechanism to combine two types of embeddings
Get better results than ConvKB Model when analyzing FB15k-237 and WN18RR data bases in link prediction. Combine other models such as TransH can also get good result.

TransE Model

The TransE model regard relations as changing from the heads to tails on the same low-dimensional plane. The score function is:

$E(h, r, t) = ||h + r - t||_{L_{n}}$

It uses $L_{n}$ distance to measure between the converted head entity $h + r$ and tail entity $t$

ConvKB Model

ConvKB uses a convolution neural network to capture the overall relations and transformation features of entities and relations in the knowledge base.

$\textbf{Representation of triples}$
Each triple (head, relation, tail) is represented as a three-column matrix.

Others

Introduce some methods in order to enhance the representation learning of KG, utilizing text information in knowledge base. In general, text augmented embedding models achieve state-of-the-art performance through integrating knowledge and text.

Text-Augmented Knowledge Graph Representation Notes

Define a knowledge graph as

$G = (E, R, T)$

where E is the set of entities, R is the set of relation types, and T is the set of factual triples.

For each knowlege triple $(h, r, t) \in T$, we have

$r \in R \space \space \text{ and } \space \space h, t \in E$

$\textbf{two kinds of entity representation types}$:

structure representation $e_{s}$
text representation $e_{d}$

For a given knowledge triple $(h, r, t) \in T$,

$h{s} \in e{s}$ represents head structure representation
$t{s} \in e{s}$ represents tail structure representation
$h{d} \in e{d}$ represents the description text coding of the head
$t{d} \in e{d}$ represents the description text coding of the tail
$R_{d}$ represents the corresponding low-dimensional vectors of entities, relations, and descriptions.
For example, embedding $h, t \in E$ and $r \in R$ are equal
to $h, t, r \in R_{d}$ respectively in the d−dimension.

Neural Network Text Encoding

$\textbf{Preprocess}$:

Remove all stop words
Mark all the words(phrases which entity names in the training set) in the descriptions
Extract multi-themed words for each entity as description
Use fastText to encode the words into word vectors as input to the A-BiLSTM encoder

$\textbf{BiLSTM Encoder}$:
To construct BiLSTM, the author need to apply a forward and backward LSTM network to each training sequence and connnect them to the same output layer. It can provide the complete contextual information of each sequence point to the output layer.

$\textbf{Self Attentive BiLSTM Encoder}$:
The author lead an attention mechanism into BiLSTM to encode depending on the different relations of the context in order to improve the text representations.

The author uses the attention mechanism to extract the features from $h_{i}$ embeddings.

For each position i of the text description, the attention to a given word embedding $h{i}$ is defined as $a{i}$.

$a_{i} = \frac{exp(v_{a}^{T}tanh(W_{a}h_{i} + b_{a}))}{\sum_j exp(v_{a}^{T}tanh(W_{a}h_{j} + b_{a}))}$

where $h{i} \in R{d}$ is the output of BiLSTM at position $i, v{a} \in R^{d \times d}$ is a parameter vector, $W{a}, b{a} \in R{d}$ are parameter matrixs.

Therefore, use the attention mechanism, we could know which part of the embeddings will be concerned. The final description embedding will be $e{i} = \sum{i=1}^{n} a{i} \cdot h{i}$

Joint Structure and Text Encoding

To balance two sources of information, the author combines the $e{s}$ and $e{d}$ for an entity $e$.

Linear Interaction

$e = \alpha \cdot e_{s} + (1 - \alpha) \cdot e_{d}$

where $0 < \alpha < 1$ and $\alpha$ is uncertain.

Second-order Interaction

Referring to the gated mechanism in LSTM gate units:

$e = tanh(e_{s}) \otimes sigmoid(e_{d})$

where $\otimes$ is point multiplication

Triple Embedding

$e$ is the final entity representation. Because there is no description for relations, the structural embeddings are regarded as the textual embeddings for relation. The triple embeddings containing entity embeddings and relation embeddings are fed into CNN to extract features for link prediction.

Training and Socre Function

$\kappa$ denotes the set of filters
$\tau$ denotes the number of filters
$\tau = |\kappa|$ results in $\tau$ feature maps
Concatenate the $\tau$ feature maps into a vector $\in R^{\tau d \times 1}$
For the triple $(h, r, t)$, the score function of TA-ConvKB is as follows: $f(h, r, t) = concat(g(tanh([h_s, r, t_s]) \otimes sigmoid([h_d, r, t_d])) * \kappa) \cdot w$ where $w \in R^{\tau d \times 1}$, $g$ is Relu activation functions, $*$ denotes a convolution operator.
Use the Adam optimizer and loss function L $L = \sum (h, r, t) log (1 + exp((l_{(h, r, t)}) \cdot f(h, r, t))) + \frac{\lambda}{2} ||w||_{2}^{2}$ where $(h, r, t) \in G ∪ G^{‘}$, $l(h,r,t) = 1$ when $(h, r, t) \in G$, $l(h,r,t) = −1$ when $(h, r, t) \in G$, here $G^{‘}$ is a collection of invalid triples generated by corrupting valid triples in $G$.

Experiments Notes

Dataset

WN18RR and FB15k237.

Assessment Strategy

$\textbf{KG link prediction task}$: the purpose is to predict a missing entity given a relation and another entity. The author ranks the scores of score function on test triples to get the result. To construct corrupted triples, replace one of the two entities in the triples.

$\textbf{Have problems to ask}$
“We use the “Filtered” setting protocol, not taking any corrupted triples that appear in the KG into accounts.”

$\textbf{Three metrics}$:

mean rank (MR) (Need lower MR)
mean reciprocal rank (MRR) (Need higher MRR)
H@10 (i.e., the proportion of the valid test triples ranking in top 10 predictions) (higher H@10.)

Training Strategy

Text Embeddings

$\textbf{In FB15k-237}$

10 keywords are extracted for each entity description
FastText is used to encode a 100-dimensional word vector
In the A-BiLSTM encoder, embedding dimension is 100, the sequence length is 10 and the number of hidden layer units is 50
Finally, output 100-d vector

$\textbf{In WN18RR}$

Similar with FB15k-237 except the keywords is 3

Structural Embeddings

Train STransE for 3,000 epochs

Hyperparameters

200 epochs
Adam as optimizer
ReLU as the activation function
Batch size at 256
L2-regularizer $\lambda$ at 0.005
In WN18RR, using k = 100, τ = 200, the truncated normal distribution for filter $w$ initialization, initial learning rate at 1e−4
In FB15k-237, k = 100, τ = 40, [0.1, 0.1, −0.1] for filter $w$ initialization, and the initial learning rate at 5e−6.

Main Experimental Results

In both dataset, the H@10 evaluation method gets the highest performance compared with other models.

Future Work Notes

Use more accurate word vector representations, such as Elmo model
Explore kinds of combinations between structural embeddings and textual embeddings
Integrate the image information of the entities

永缘空的博客

Text-Augmented-Knowledge

前言

Introduction Notes

KGs completion or link prediction

Importance of Text Descriptions

TA-ConvKB Model

Contributions

TransE Model

ConvKB Model

Others

Text-Augmented Knowledge Graph Representation Notes

Neural Network Text Encoding

Joint Structure and Text Encoding

Linear Interaction

Second-order Interaction

Triple Embedding

Training and Socre Function

Experiments Notes

Dataset

Assessment Strategy

Training Strategy

Text Embeddings

Structural Embeddings

Hyperparameters

Main Experimental Results

Future Work Notes

前言

Introduction Notes

KGs completion or link prediction

Importance of Text Descriptions

TA-ConvKB Model

Contributions

Related Work Notes

TransE Model

ConvKB Model

Others

Text-Augmented Knowledge Graph Representation Notes

Neural Network Text Encoding

Joint Structure and Text Encoding

Linear Interaction

Second-order Interaction

Triple Embedding

Training and Socre Function

Experiments Notes

Dataset

Assessment Strategy

Training Strategy

Text Embeddings

Structural Embeddings

Hyperparameters

Main Experimental Results

Future Work Notes