|
1.INTRODUCTIONIn the face of the rapid development trend of scientific and technological knowledge, understanding and mastering the scientific and technological information contained in patent data has an important impact on promoting entrepreneurship and innovation of enterprises and individual. Patent-related data contains many innovative cutting-edge technologies and have relatively high value in the industry. Patents and the technology contained in patents have become an important resource to promote the progress and innovation of enterprises, mastering the most advanced technology in the industry can improve the competitiveness of enterprises, so mastering the development of patents in different technical fields is of great significance to enterprises and even countries. For China’s power grid science and technology staff,the current access to the latest patent science and technology information mainly relies on patent search websites, which only present patent search results to users through string matching, making it difficult to explore the connection and similarity between patents. When a patent is retrieved, the power grid scientists hope to recommend a patent that is as like the patent as possible, to broaden their research ideas and give them a general understanding of the latest patent situation in a subfield. Based on these problems, this paper uses knowledge graph to model and store power grid patent data, since knowledge graph can better capture the connection between patents and realize the calculation of similarity between patents and patent recommendation according to the constructed knowledge graph. 2.KNOWLEDGE GRAPH CONSTRUCTION2.1Introduction to the knowledge graphIn 2012, Google first proposed the knowledge graph, and they immediately began using the knowledge graph technology to improve the core search engine1. Knowledge graph is essentially a semantic network that describes objective things in the form of a graph, composed of nodes and edges. Each knowledge point represents a triad (Subject-Predicate-Object, SPO), which can also be recorded as HRT (Head-Relation-Table).2 The knowledge graph consists of triples in the following two forms: ① “Entity-relationship-entity” describes the relationship between entities, such as “Guo Qilin-Father-Guo Degang”; ② “Entity-attribute-attribute value” describes the relationship between entities and their attribute values, such as “Guo Qilin-Gender-male” 3. The general construction flow of the knowledge graph is shown in Figure 14. At present, the mainstream knowledge graph is generally divided into two types, one is the knowledge graph of the general domain, such as DBpedia5, Freebase, Wikidata, etc. They are mainly oriented to the general common-sense domain, and the search and question answering system of the underlying technology of these knowledge graphs can answer most general domain knowledge. There is also a vertical knowledge graph, which aims to build a knowledge base in a specific field. Compared with the general knowledge graph, the domain knowledge graph requires a more rigorous definition of data patterns and requires the person who constructs the map to have a clear understanding of the knowledge framework of the field. Domain knowledge graph is generally used as the vertical search and recommendation of the closed field, such as the e-commerce knowledge graph built by Alibaba mainly serves the product search and recommendation behavior of Taobao users. At present, there are very few knowledge maps applied to the patent field, especially in the field of power grid patents, and it is difficult for power grid scientists to find their correlation from a large number of patent data. The construction of knowledge graph is usually divided into the following steps: ontology definition->knowledge extraction->knowledge fusion-> knowledge storage. 2.2Patent ontology constructionIn the process of constructing a knowledge graph, ontology construction is an important link, and patent-related professional terms defined in the ontology will affect the quality of the graph6.Ontology realizes the abstract representation of patent-related text data, completes and represents the relevant concepts of patents and the relationships between them, in order to effectively correlate patent data and standardize the representation of entities in the data layer. Considering that the scope of knowledge in the patent field is relatively fixed, this paper adopts a top-down approach to construct the patent ontology library7,After analyzing the importance of each entity attribute of the original text of the patent, the schema that determines the entity and relationship of the patent is shown in Table 1 and Table 2: Table 1.Patent Entity Definition.
Table 2.Patent Relationship Definition.
After defining schemas for entities and relationships, the ontology can be instantiated by knowledge extraction from the original data. 2.3Entity relationship extractionPatent ID, patent applicant, inventor and IPC classification number can be obtained by simple script processing of the original data. We design an algorithm to extract the keywords of the patent. The steps of the keyword extraction algorithm proposed in this paper for patent text are as follows: Stitch together the title and abstract text of the patent to form a piece of text and remove meaningless stop words from the text by customizing the stop word list.
Table 3.Examples of patented keywords.
2.4Knowledge storageAfter extracting the patent knowledge of the power grid, the triples formed by the extraction are stored in the NEO4J database. Neo4j is an open-source NoSQL graphics database that started development in 2003, using the scala and java languages, and was published in 2007.8It stores the data and its attribute associations in the structure of a graph. The Neo4j graph database consists of three basic elements: nodes (nodes), relationships (relationships) and attributes (properties), each of which can be stored independently.All of its relationships and nodes are available to create properties, represented as key-value pairs, similar to the hashMap data structure.9.Figure 2 takes patent CN108962515B as an example to show the relationship data information between entities and entities extracted from the patent. According to the structure diagram of patent entity relationship, the nodes of different colors represent the patent publication number, keywords, inventor, applicant and IPC classification number, and the relationship between each entity can also be displayed intuitively. 3.OPTIMIZE THE KNOWLEDGE GRAPH EMBEDDING ALGORITHM FOR NEGATIVE SAMPLING3.1Patent vectorized representation based on knowledge graph embeddingThe vectorized representation of the knowledge graph, also known as knowledge representation learning, knowledge graph embedding, refers to the method of representing entities and relationships in the knowledge graph as real-valued numeric vectors which have certain semantic expression capabilities. After the storage of the knowledge graph, the similarity calculation and recommendation between entities cannot be achieved by relying solely on the text information of the triplet, and the entities and relationships need to be vectorized to calculate the similarity between entities.The TransE algorithm10 is a classical knowledge graph embedding method, which regards a knowledge triplet (h,r,t) as a translation operation from its head entity h through the relation r to the tail entity t, that is, simply consider the sum of the head entity vector and the relationship vector should be as equal as possible . The TransE algorithm can be very handy implemented with the OpenKE tool and is accurate and efficient in representing 1-1 relationships. However, the original random negative sampling algorithm of the TransE algorithm is easy to produce many false negative samples when facing the relationship between 1-N and N-N, which affects the model training effect and the final entity relationship representation. Therefore, this paper proposes a TransE-SNS model which optimize the negative sampling process of TransE. The optimized negative sampling algorithm replaces entities with a certain probability for triples other than one-to-one. For each relationship in the patent knowledge graph, according to the existing triplet data information, the average number of tail entities corresponding to the head entity under this relationship is counted separately Ntp,The average of the number of tail entities corresponding to the number of head entities under this relationship Nhp,Then the probability of replacing the entity p is calculated as : In TransE-SNS, the entity data is no longer randomly replaced, which can avoid excessive false negative samples in the negative sampling process and retain the complex semantic correlation between the original correct triples, which makes the TransE model more realistic in the vectorization process. The improved algorithm flow is shown below: Input Knowledge graph G=(E,R,P),E=(e1,e2,…,en),R=(r1,r2,…,rm), P=(p1,p2,…,pl) Output Entity vector and relation vectors Parameters entity and vector dimension d, learning rate λ, distance adjustment parameter γ, size b of each batch, number of training batches k.
For the evaluation of the quality of the knowledge graph vectorization representation algorithm, link prediction is commonly used as a test task, and hit@10 and MeanRank11 are used as indicators for evaluation.The TransE-SNS model was used to represent the semantic information in the patent entity in a vector space, and the original data was 1000 patent data related to the power industry during the experiment. According to the patent knowledge graph constructed in Chapter 2, a total of 12,328 pieces of application, invention and inclusion relationship data in the triplet (h,r,t) are obtained, namely 1985 application relationship data between applicant and patent, 5343 invention relationship data between inventor and patent, and 5000 inclusion relationship data between patent and keyword. During the TransE-SNS model training process, 12328 relational data pieces were trained and tested at a 4:1 ratio. Take the learning rate λ=0.01, the boundary value γ =2, the dimension of entity embedding d={20,50,70,100,150}, and finally determine the optimal embedding dimension according to the MeanRank value and Hist10 value. To optimize the effectiveness of the negative sampling algorithm, the TransE algorithm and the improved TransE-SNS algorithm are used to embed the patent entity relationship and calculate the corresponding MeanRank and hit@10 values. The smaller the MeanRank value and the higher the Hist10 value, the better the model. The MeanRank values and hit@10 values of the TransE model and the improved TransE-SNS model are shown in Table 4 and Table 5, and their scoring trends are shown in Figure 3 and Figure 4. Table 4.The MeanRank value of the TransE algorithm and the TransE-SNS algorithm in different embedding dimensions.
Table 5.The hit@10 values of the TransE algorithm and the TransE-SNS algorithm in different embedding dimensions.
Through the experimental results of the average rank MeanRank value and the top ten hit rate hit@10 value, when the embedded dimension of the patent entity vector is 100, its training effect is relatively optimal, and the improved negative sampling algorithm can make the model achieve a better effect. 4.PATENT RECOMMENDATION THAT INTEGRATES KNOWLEDGE GRAPH EMBEDDING AND WORD EMBEDDINGAt present, in the field of patent recommendation, it is mainly based on content-based recommendation algorithms. Based on the content of patent texts, patent recommendation is realized by calculating the semantic similarity between texts, but this method does not consider the interrelationship between multiple patents, and it is difficult to recommend patents in similar fields. By integrating the recommendation algorithm based on patent knowledge graph and content, this paper can better model the key between patents while retaining the semantic information of patent text, so as to achieve more accurate patent recommendation. 4.1Patent similarity calculation based on word embeddingContent-based recommendation algorithm has been applied in major fields. This paper simply calculates the text semantic similarity between the patents through title and abstract to recommend patent. We need to vectorize the text content to obtain the text vectorization result before calculating the similarity. The word embedding12 method of converting natural language text into vectors is very mature, and the mainstream word embedding methods are distributed representation methods based on word2vec. This paper uses the tokenizer library in Baidu paddle framework PaddleNLP to realize the word vector embedding of patent title and abstract text, because Baidu’s natural language processing framework has been pre-trained on large-scale Chinese corpus in advance, and can better identify entities in long text. After setting the embedding dim to 300, we can get the word vector converted by the title and abstract text of each patent. Then we can calculate the textual similarity between patents by cosine similarity. A, B are the word vectors of any two patent texts. Taking patent ‘an electric vehicle charging plug-and-play system control method equipped with energy storage battery’ as an example, the IPC classification number is H02J7/02, and the top 10 patents with the title and abstract similarity of the patent are shown in Table 6. Table 6.Patent similarity calculation based on Word Embedding.
Most of the patents obtained by this method are calculated based on the number of words co-existing between the two patents, which only considers the textual information, but ignore the possible connection between the patents and the potential attributes of the patent, making the recommended patents have little connection with the original patent in content. For example, we can roughly see the technical field of the recommended patent through the IPC of the similar patent. Although most of them start with H electricity, there are two patents belonging to parts A and G respectively, and some recommended patents and the H02J7 small part of the original patent do not match, indicating that their research fields are not identical, and the possibility of similarity in the patented technologies is very low. 4.2Patent similarity calculation combining knowledge graph embedding and word embeddingAfter using the TransE-SNS model to vectorize the entities in the patent knowledge graph, the vectorized representation of the patent title entities is obtained, which can better contain the rich semantic relationships in the map. This paper proposes a patent recommendation algorithm that integrates knowledge graph embedding and word embedding and splices the patent entity knowledge graph embedding vector obtained in Chapter 3 with the patent text word embedding vector. The spliced 400-dimensional vector contains the entity relationship information in the text and knowledge graph at the same time, and then calculates the similarity to obtain a better recommendation effect. Using this algorithm, the TOP10 similar patents of the example patent in the previous section are obtained as shown in Table 7. Table 7.Entity similarity of patent titles integrating knowledge graph embeddings and word embeddings.
As we can see, this patent recommendation algorithm can more accurately recommend patents in similar fields, while considering the text similarity of patent titles.The algorithm fuses the text features of patent data titles and the entity relationship features in the patent knowledge graph and can more accurately identify patent data collections with similar research fields and similar patent texts, which is of great help to researchers in querying and recommending similar patents. The experimental results have also been externally verified. After consulting with patent agents, the patent recommendation results obtained in this paper are reasonable, and the scope of technical fields can be determined with enough data. From this perspective, it can save researchers’ time in searching similar patents. 5.CONCLUSIONIn this paper, we take the electric power as the research direction. By extracting the entity and relationship data of electric power field patents and storing them in the knowledge graph in the form of a triad, we can obtain better results by fusing the text vector features in the patent data and the patent relationship vector features in the knowledge graph, and then calculate the similarity between patents. The similar patent pairs calculated by this algorithm can be used for patent recommendation or real-time pushing, which will save the time of researchers in finding patents in similar fields. Also, this method can be used in the construction of patent maps in all fields, not only in the field of electricity. In the future work, we can consider increasing the types of entities and relationships in patent mapping to enrich the features of patents and thus obtain more accurate results. REFERENCEAuer, S., Kovtun, V., Prinz, M., Kasprzik, A., Stocker, M., Vidal, M. E.,
“Towards a knowledge graph for science,”
in In Proceedings of the 8th international conference on web intelligence, mining and semantics,
1
–6
(2018). Google Scholar
Deng, L., Cao, C. G.,
“A method for building a patent knowledge graph,”
computer science, 49
(11),
(2022). Google Scholar
Kejriwal, M., Sequeda, J. F., Lopez, V.,
“Knowledge graphs: Construction, management and querying[J],”
Semantic Web, 10
(6), 961
–962
(2019). https://doi.org/10.3233/SW-190370 Google Scholar
Lu, C. C., Liu, Y. S. Y., Gu, F., Gu, X. J.,
“Research on the construction of API international drug registration knowledge map based on Neo4j graph database,”
Group Technology and Production Modernization, 37
(04), 1
–44
(2020). Google Scholar
Yin, D. Y.,
“Research on the construction of police patrol knowledge graph [D],”
Beijing: People’s Public Security University of China,
(2019). Google Scholar
Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.,
“A review of relational machine learning for knowledge graphs,”
in Proceedings of the IEEE,
11
–33
(2015). Google Scholar
Suchanek, F. M., Kasneci, G., Weikum, G.,
“Yago: A large ontology from wikipedia and wordnet,”
Journal of Web Semantics, 6
(3), 203
–217
(2008). https://doi.org/10.1016/j.websem.2008.06.001 Google Scholar
Song. H. Y.,
“Research and application of power system knowledge graph based on graph database,”
Master of Electronic Journal Publishing Information: Period: Issue,
(08),
(2021). Google Scholar
Wang, F., Yi, M. Z., Tan, X., et al.,
“Research on the storage method of Tibet-related domain ontology based on Neo4j [J],”
Journal of Zhengzhou University (Science edition), 51
(2), 60
–65
(2019). Google Scholar
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.,
“Translating embeddings for modeling multi-relational data,”
Advances in neural information processing systems, 26
(2013). Google Scholar
Zhu, M., Zhen, D. S., Tao, R., Shi, Y. Q., Feng, X. Y., Wang, Q.,
“Top-N collaborative filtering recommendation algorithm based on knowledge graph embedding,”
in In Knowledge Management in Organizations: 14th International ConferenceSpringer International Publishing,
122
–134
(2019). Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.,
“Efficient estimation of word representations in vector space,”
arXiv preprint arXiv:1301.3781,
(2013). Google Scholar
|