前言
《Text-Augmented Knowledge Representation Learning Based on Convolutional Network》阅读笔记
Introduction Notes
Knowledge Graphs have the relationship triples (subject, relation, object) denoted as (h,r,t), which are useful for IR tasks.
KGs completion or link prediction
Some research focused on the knowledge graph completion or link prediction task which aims to predict missing triples in KGs.
- $\text{Rescal model}$: Matrix decomposition for knowledge representation learning.
- $\text{DistMult model}$: Set relation matrix as diagonal matrix; Use “circular correlation” of the head and tail entity vectors to represent the entity pair.
- $\text{ConvE model}$: The first model applying convolutional neural network for the knowledge graph completion task.
- $\text{ConvKB model}$: Apply a convolutional neural network to calculate the fraction of the input triples.
Importance of Text Descriptions
The semantic expression between entities can identify true triples.
For example: - the triple (Barack Hussein Obama Sr, parent of, Barack Obama)
- Indicates “Barack Hussein Obama Sr” is the parent of “Barack Obama”.
- We don’t know “parent of” refers to “father” or “mother”.
The text description helps confirm the true entity.
- In the text description of the true entity, the keywords are “he” and “his wife”.
- The false entity text description has “she” and “mother”.
- Description information can improve task performance such as link prediction.
TA-ConvKB Model
Because ConvKB model only learn triples from KG and does not utilize description information. The author raised a text augmented method based on ConvKB (TA-ConvKB). It combines the information of factual triples and descriptions to enhance the accuracy of knowledge graphs link prediction.
$\textbf{Steps}$
- Pretrain the entities descriptions and extract a multiple of keywords.
- Use fastText to encode those words into vectors.
- Input word embeddings into the A-BiLSTM encoder to extract distinctive features.
- Acquire a text vector that best expresses the entity information
- Use the gate mechanism (inspired by LSTM gate units) to integrate structural embeddings obtained by StransE model and textual embeddings.
Contributions
- Introduce the TA-ConvKB model and use A-BiLSTM to encode entities descriptions.
- Use a brand-new gate mechanism to combine two types of embeddings
- Get better results than ConvKB Model when analyzing FB15k-237 and WN18RR data bases in link prediction. Combine other models such as TransH can also get good result.
Related Work Notes
TransE Model
The TransE model regard relations as changing from the heads to tails on the same low-dimensional plane. The score function is:
It uses $L_{n}$ distance to measure between the converted head entity $h + r$ and tail entity $t$
ConvKB Model
ConvKB uses a convolution neural network to capture the overall relations and transformation features of entities and relations in the knowledge base.
$\textbf{Representation of triples}$
Each triple (head, relation, tail) is represented as a three-column matrix.
Others
Introduce some methods in order to enhance the representation learning of KG, utilizing text information in knowledge base. In general, text augmented embedding models achieve state-of-the-art performance through integrating knowledge and text.
Text-Augmented Knowledge Graph Representation Notes
Define a knowledge graph as
where E is the set of entities, R is the set of relation types, and T is the set of factual triples.
For each knowlege triple $(h, r, t) \in T$, we have
$\textbf{two kinds of entity representation types}$:
- structure representation $e_{s}$
- text representation $e_{d}$
For a given knowledge triple $(h, r, t) \in T$,
- $h{s} \in e{s}$ represents head structure representation
- $t{s} \in e{s}$ represents tail structure representation
- $h{d} \in e{d}$ represents the description text coding of the head
- $t{d} \in e{d}$ represents the description text coding of the tail
- $R_{d}$ represents the corresponding low-dimensional vectors of entities, relations, and descriptions.
- For example, embedding $h, t \in E$ and $r \in R$ are equal
to $h, t, r \in R_{d}$ respectively in the d−dimension.
Neural Network Text Encoding
$\textbf{Preprocess}$:
- Remove all stop words
- Mark all the words(phrases which entity names in the training set) in the descriptions
- Extract multi-themed words for each entity as description
- Use fastText to encode the words into word vectors as input to the A-BiLSTM encoder
$\textbf{BiLSTM Encoder}$:
To construct BiLSTM, the author need to apply a forward and backward LSTM network to each training sequence and connnect them to the same output layer. It can provide the complete contextual information of each sequence point to the output layer.
$\textbf{Self Attentive BiLSTM Encoder}$:
The author lead an attention mechanism into BiLSTM to encode depending on the different relations of the context in order to improve the text representations.
The author uses the attention mechanism to extract the features from $h_{i}$ embeddings.
For each position i of the text description, the attention to a given word embedding $h{i}$ is defined as $a{i}$.
where $h{i} \in R{d}$ is the output of BiLSTM at position $i, v{a} \in R^{d \times d}$ is a parameter vector, $W{a}, b{a} \in R{d}$ are parameter matrixs.
Therefore, use the attention mechanism, we could know which part of the embeddings will be concerned. The final description embedding will be $e{i} = \sum{i=1}^{n} a{i} \cdot h{i}$
Joint Structure and Text Encoding
To balance two sources of information, the author combines the $e{s}$ and $e{d}$ for an entity $e$.
Linear Interaction
where $0 < \alpha < 1$ and $\alpha$ is uncertain.
Second-order Interaction
Referring to the gated mechanism in LSTM gate units:
where $\otimes$ is point multiplication
Triple Embedding
$e$ is the final entity representation. Because there is no description for relations, the structural embeddings are regarded as the textual embeddings for relation. The triple embeddings containing entity embeddings and relation embeddings are fed into CNN to extract features for link prediction.
Training and Socre Function
- $\kappa$ denotes the set of filters
- $\tau$ denotes the number of filters
- $\tau = |\kappa|$ results in $\tau$ feature maps
- Concatenate the $\tau$ feature maps into a vector $\in R^{\tau d \times 1}$
- For the triple $(h, r, t)$, the score function of TA-ConvKB is as follows:where $w \in R^{\tau d \times 1}$, $g$ is Relu activation functions, $*$ denotes a convolution operator.
- Use the Adam optimizer and loss function Lwhere $(h, r, t) \in G ∪ G^{‘}$, $l(h,r,t) = 1$ when $(h, r, t) \in G$, $l(h,r,t) = −1$ when $(h, r, t) \in G$, here $G^{‘}$ is a collection of invalid triples generated by corrupting valid triples in $G$.
Experiments Notes
Dataset
WN18RR and FB15k237.
Assessment Strategy
$\textbf{KG link prediction task}$: the purpose is to predict a missing entity given a relation and another entity. The author ranks the scores of score function on test triples to get the result. To construct corrupted triples, replace one of the two entities in the triples.
$\textbf{Have problems to ask}$
“We use the “Filtered” setting protocol, not taking any corrupted triples that appear in the KG into accounts.”
$\textbf{Three metrics}$:
- mean rank (MR) (Need lower MR)
- mean reciprocal rank (MRR) (Need higher MRR)
- H@10 (i.e., the proportion of the valid test triples ranking in top 10 predictions) (higher H@10.)
Training Strategy
Text Embeddings
$\textbf{In FB15k-237}$
- 10 keywords are extracted for each entity description
- FastText is used to encode a 100-dimensional word vector
- In the A-BiLSTM encoder, embedding dimension is 100, the sequence length is 10 and the number of hidden layer units is 50
- Finally, output 100-d vector
$\textbf{In WN18RR}$
- Similar with FB15k-237 except the keywords is 3
Structural Embeddings
- Train STransE for 3,000 epochs
Hyperparameters
- 200 epochs
- Adam as optimizer
- ReLU as the activation function
- Batch size at 256
- L2-regularizer $\lambda$ at 0.005
- In WN18RR, using k = 100, τ = 200, the truncated normal distribution for filter $w$ initialization, initial learning rate at 1e−4
- In FB15k-237, k = 100, τ = 40, [0.1, 0.1, −0.1] for filter $w$ initialization, and the initial learning rate at 5e−6.
Main Experimental Results
In both dataset, the H@10 evaluation method gets the highest performance compared with other models.
Future Work Notes
- Use more accurate word vector representations, such as Elmo model
- Explore kinds of combinations between structural embeddings and textual embeddings
- Integrate the image information of the entities