neo4j - data science - part 4.2

10 분 소요

link prediction

neo4j - data science - part 4에 작성된 link prediction 내용을 번역하여, 정리하였습니다.
이 모듈에서는 citation graph에서 co-authorship을 예측하는 머신러닝 분류기를 만듭니다. 따라서 다음을 배우게 되죠.
- graph data에 대해서 ML 모델을 세울 때 무엇이 문제가 되는지.
- neo4j로부터 나온 결과에 대해서 sk-learn 알고리즘을 적용하는 방법.

Exercise 2: Building a binary classifier

이전에 나온 방법은 graph Algorithm을 사용한 것이었고, 여기서는 binary classifier를 사용하여 link prediction을 예측해봅니다. 즉, 연결이 되어 있는 것은 True, 아니면 False로 표현하여 처리하여, 학습시키면 되죠.

Building a co-author graph

우선, 같은 페이퍼를 작성한 author 정보를 활용하여 co-author graph를 구축해봅시다. 기존 그래프에서는 author간의 관계는 존재하지 않습니다. 따라서, article과 Author의 관계를 찾고, 그 둘간에 새로운 관계인 CO_AUTHOR 라벨을 설정해줍니다
또한, 데이터베이스에 과부하가 걸릴 수 있으므로 작업관리를 위해 apoc.periodic.iterate를 사용하여 처리합니다.

// apoc.periodic.iterate는 일종의 작업관리(job management)로서, 
// 한번에 너무 많은 일들을 시키면 데이터베이스에 무리가 갈 수 있으므로, 
// 분할하여 일정한 batch size로 꾸준하게 처리해주는 것을 말합니다.
CALL apoc.periodic.iterate(
    // first statement: 적용대상을 찾는 쿼리, 
    "MATCH (a1)<-[:AUTHOR]-(paper)-[:AUTHOR]->(a2:Author)
    WITH a1, a2, paper
    ORDER BY a1, paper.year
    RETURN a1, a2, collect(paper)[0].year AS year, count(*) AS collaborations",
    // second statement: first statement의 결과에 무엇을 적용할 것인가? 
    "MERGE (a1)-[coauthor:CO_AUTHOR {year: year}]-(a2)
    SET coauthor.collaborations = collaborations", 
    // config: batchSize 몇개나 적용할 것인가? 
    {batchSize: 100}
)

Train and Test split

본 작업의 목적은 link prediction입니다. 그런데, 이게 좀 애매한 것은, ‘시간의 흐름’을 반영하는 것이냐, 반영하지 않는 것에 따라서 적용방식이 다르다는 것이죠. link prediction 자체는 ‘아마 이 둘은 이후에 연결될 것이다’를 예측하이 위한 것인데요. 그렇다면, 하나의 시간 축에서 (과거의 그래프)를 X로 (미래의 그래프)를 Y로 구축하여, 머신러닝 모델을 구축하는 것이 타당합니다. 또한, 각각의 X, Y도 Train/Test 로 구분되어야 하는 것은 당연하죠.
물론, 이 마저도 완벽한 것이라고 보기는 어렵습니다. 사실 그냥 과거/ 미래 이렇게 둘로 나누는 것 자체가 좀 말이 안되거든요. 이 시간축이 T로 구분되어 있다면, T+한달, T+12달 의 차이가 반영되지 않아요. 부정확하죠.
이런저런 한계들이 있습니다만, 본문에서는 그러한 한계를 모두 무시하고, 비교적 기본적인 방법으로 접근합니다. 2006년까지의 모든 데이터를 활용해 그래프를 만들고, 이 그래프에서 존재하는 Edge가 Positive, 존재하지 않는 Edge가 Negative로 보는 것이죠. 즉, 시간의 축을 무시하고, 현재의 그래프에서 link가 존재하는 것, 존재하지 않는 것을 구분하여 학습하는 것인데, 이는 “하나의 그래프 내에서 Edge에는 공통된 성질이 있을 것이다, 그것을 ML을 통해 학습한다”가 맞죠. Link Prediction이 아닌, Link Existence 라는 말이 더 적합하게 느껴져요.
어쨌거나, 본문에서는 그저 2006년이라는 시점을 기준으로 하여 데이터를 구분합니다. 따라서, 2006년 이전의 CO_AUTHOR관계에는 CO_AUTHOR_EARLY를 붙여주고, 2006년 및 그 이후의 관계에는 CO_AUTHOR_LATE를 붙여줍니다. 그리고 각각이 Train-Set, Test-Set이 됩니다.

query = """
MATCH (a)-[r:CO_AUTHOR]->(b) 
where r.year < 2006
MERGE (a)-[:CO_AUTHOR_EARLY {year: r.year}]-(b);
"""
print(graph.run(query).stats())
print("CO_AUTHOR_EARLY labeled")

query = """
MATCH (a)-[r:CO_AUTHOR]->(b) 
where r.year >= 2006
MERGE (a)-[:CO_AUTHOR_LATE {year: r.year}]-(b);
"""

print(graph.run(query).stats())
print("CO_AUTHOR_LATE labeled")

그리고, CO_AUTHOR_EARLY와 CO_AUTHOR_LATE에 각각 몇개의 관계들이 존재하는지를 보기 위해 다음과 같은 간단한 쿼리를 실행해봅니다. 각각, 81,096개, 74,128개이며, 대충 적절하네요.

query = """
MATCH ()-[:CO_AUTHOR_EARLY]->()
RETURN count(*) AS count
"""

print(graph.run(query).to_data_frame())

query = """
MATCH ()-[:CO_AUTHOR_LATE]->()
RETURN count(*) AS count
"""

print(graph.run(query).to_data_frame())

Positive and Negative examples

Negative Examples은 False, 즉, link가 없는 경우를 말합니다. 그래프에서 가능한 모든 edge의 수는 (# of nodes)^2가 되죠. 그래서, negative example은 가능한 모든 edge 들에서, 이미 존재하는 edge들을 빼주면 됩니다.
하지만, 이는 class imbalance 문제를 가져오게 되죠. 쉽게 설명하기 위해서 암 발병율을 예측하기 위한 데이터를 모았다고 하면, 암을 발병한 데이터와, 암이 발병하지 않은 데이터가 비슷하게 있어야 합니다. 하지만, 보통 암이 발별하지 않은 데이터가 훨씬 많고, 따라서, 데이터에 균형이 깨지죠.
같은 맥락으로, 존재할 수 있는 모든 edge에 대해서 negative example을 만든다면, class imbalance 문제가 발생하게 되고 따라서, 효과적으로 ML 모델을 학습시킬 수 없습니다. 보통 이 경우에는 두 가지 정도의 테크닉을 쓰게 되는데, 1) 데이터가 많은 쪽에서 샘플링을 수행하거나, 2) 데이터가 적은 쪽에서 데이터를 늘리거나. 둘다 대충 두 데이터의 수를 일정하게 맞추는 전략인 셈이죠. 이를 위해 여기서는 아래 함수를 사용하여 데이터를 샘플링합니다.

def down_sample(df):
    """
    - 이 함수는 우선 input_data_frame의 `label(or class)`가 항상 0과 1로만 구성되어 있는 것을 알 고 있으며, 
    - 1이 훨씬 많으므로, 그 차이만큼을 랜덤하게 뽑아서 삭제해주는 함수입니다. 
    - 따라서, 이 방식으로 noise imbalance 를 처리해줍니다.
    """
    # df: 'label' 칼럼을 가지고 있어야 하며, 
    # label 칼럼은 0과 1만으로 구성되어 있고, 0이 1보다 많아야 함.
    df_cp = df.copy() 
    # label 칼럼의 값들(0, 1로 구성)을 개수로 Counter로 구축.
    binary_count_dict = Counter(df_cp['label'].values)
    # 0, 즉, False의 개수 : 1, 즉, True의 개수
    # 0이 1보다 훨씬 많으므로 그 수만큼을 없애야 함.
    zero_count = binary_count_dict[0]
    one_count = binary_count_dict[1]
    # False인 것들을 찾고, n개 만큼 샘플링하여, index를 넘겨서 드롭해줌.
    df_cp = df_cp.drop(
        df_cp[df_cp['label'] == 0].sample(
            n=zero_count - one_count, random_state=1
        ).index
    )
    # 1.0
    return df_cp.sample(frac=1.0)

또한 쿼리에서도, negative example을 가져올 때, node by node로 모두 가져오는 것이 아닌, “노드 간에 2 ~ 3의 거리를 가진 거리 중에서 연결되지 않은 edge”를 False로 가져왔습니다. 즉, 말도 안되게 멀리 있는 edge들은 제외한 것이죠.
다만, 이를 다르게 말하면, 너무 가까이 있는 edge만을 학습하는 식으로 과적합되어다고 할 수 있는 것은 아닐까, 싶어요. 하지만, 일단 하라고 하니까 합니다.

# 우선, Train set에서 POSITIVE한 edge들을 읽어서 가져옵니다.
if TRAIN_POSITIVE_READ:
    print(f"== Train_POSITIVE Read {datetime.datetime.now()}")
    #start_time = time.time()
    #print(datetime.datetime.now())
    train_existing_links = graph.run("""
    MATCH (author:Author)-[:CO_AUTHOR_EARLY]->(other:Author)
    RETURN id(author) AS node1, id(other) AS node2, 1 AS label
    """).to_data_frame()
    #print(train_existing_links.describe())
    train_existing_links.to_pickle('TRAIN_POSITIVE.pkl')
    print(f"== Train_POSITIVE complete {datetime.datetime.now()}")
else:
    train_existing_links = pd.read_pickle("TRAIN_POSITIVE.pkl")
print("== Train True Desc")
#print(train_existing_links.describe())
print("=="*30)

# Train set에서 NEGATIVE한 edge들을 읽어서 가져옵니다.
# 앞서 말한 바와 같이, 2-3단계로 접근가능한 Edge만을 읽어서 False로 넣어줍니다.
# 속도가 참으로 오래 거리므로, 가능하면 로컬에 저장하시는 걸 추천드립니다.
if TRAIN_NEGATIVE_READ:
    print(f"== Train_NEGATIVE Read {datetime.datetime.now()}")
    train_missing_links = graph.run("""
    // 누군가와 CO_AUTHOR_EARLY 관계를 맺고 있는 author를 찾고
    MATCH (author:Author)
    WHERE (author)-[:CO_AUTHOR_EARLY]-()
    // author 2, 3단계 로 CO_AUTHOR_EARLY 관계를 맺고 있는 other_author를 찾고,
    MATCH (author)-[:CO_AUTHOR_EARLY*2..3]-(other_author)
    // author, other_author는 직접적으로 CO_AUTHOR_EARLY를 맺고 있지는 않고, 
    WHERE not((author)-[:CO_AUTHOR_EARLY]-(other_author))
    // author, other_author의 id를 리턴한다. 
    // 여기서, 0을 label로 리턴하는 것은 이것이 negative example 즉, False이기 때문.
    RETURN id(author) AS node1, id(other_author) AS node2, 0 AS label
    """).to_data_frame()

    train_missing_links = train_missing_links.drop_duplicates()
    train_missing_links.to_pickle('TRAIN_NEGATIVE.pkl')
    print(f"== Train_NEGATIVE complete {print(datetime.datetime.now())}")
else:
    train_missing_links = pd.read_pickle("TRAIN_NEGATIVE.pkl")
# 이제 train_existing_links, train_missing_links 각각에 원하는 데이터가 들어갔으며, 
# DB에 부하를 걸지 않기 위해 로컬에 피클로 저장하였다.
print("== Train False Desc")
#print(train_missing_links.describe())
print("=="*30)

위 코드를 실행하고 나면, TRAIN_SET의 POSITIVE(선이 존재하는 경우), NEGATIVE(선이 존재하지 않는 경우)에 대해서 각각 train_existing_links, train_missing_links에 값들이 저장됩니다.
그리고 아래와 같은 방식으로 TEST에 대해서도 똑같이 진행을 해줍니다.

if TEST_POSITIVE_READ:
    print(f"== Test_POSITIVE Read {datetime.datetime.now()}")
    test_existing_links = graph.run("""
    MATCH (author:Author)-[:CO_AUTHOR_LATE]->(other:Author)
    RETURN id(author) AS node1, id(other) AS node2, 1 AS label
    """).to_data_frame()

    test_existing_links.to_pickle('TEST_POSITIVE.pkl')
    print(f"== Test_POSITIVE complete {datetime.datetime.now()}")
else:
    test_existing_links = pd.read_pickle("TEST_POSITIVE.pkl")
print("== Test True Desc")
#print(train_missing_links.describe())
print("=="*30)
if TEST_NEGATIVE_READ:
    print(f"== Test_NEGATIVE Read {datetime.datetime.now()}")
    test_missing_links = graph.run("""
    MATCH (author:Author)
    WHERE (author)-[:CO_AUTHOR_LATE]-()
    MATCH (author)-[:CO_AUTHOR_LATE*2..3]-(other)
    WHERE not((author)-[:CO_AUTHOR_LATE]-(other))
    RETURN id(author) AS node1, id(other) AS node2, 0 AS label
    """).to_data_frame()
    test_missing_links = test_missing_links.drop_duplicates()
    test_missing_links.to_pickle('TEST_NEGATIVE.pkl')
    print(f"== Test_NEGATIVE complete {datetime.datetime.now()}")
else:
    test_missing_links = pd.read_pickle("TEST_NEGATIVE.pkl")

그리고, 다음으로 각각을 하나의 dataframe에 합쳐줍니다.

training_df = train_missing_links.append(train_existing_links, ignore_index=True)
test_df = test_missing_links.append(test_existing_links, ignore_index=True)
#training_df['label'] = training_df['label'].astype('category')

training_df = down_sample(training_df)
test_df = down_sample(test_df)

Generating graphy features

자, 이제 link prediction을 예측하는 모델을 세울겁니다. 그전에, 생각해보면 우리한테는 어떤 지표도 없습니다. 각 Node가 가지고 있는 값들이 있고, 이 값들을 사용해서 Y(선이 연결되어 있느냐 아니냐)를 예측해야 하죠. 우리는 지금 아무 값도 없기 때문에, 이 값을 생성해줘야 합니다.
Graph적인 특성으로 보았을 때, Degree, Centrality, PageRank 등 다양한 알고리즘들이 있습니다. 물론, Networkx로 쓰면 훨씬 쉽게 처리할 수 있습니다만, 제가 참고한 링크에서는 이를 굳이 cypher를 사용해서 처리하였으므로 이 부분을 cypher를 처리해서 정리해줍니다.

def apply_graphy_features(input_df, rel_type):
    # input_df: 'node1', 'node2', 'label'을 갖춘 dataframe 
    # rel_type : "CO_AUTHOR_EARLY" or "CO_AUTHOR_LATE"
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           algo.linkprediction.commonNeighbors(
               p1, p2, {relationshipQuery: $relType}
            ) AS cn,
           algo.linkprediction.preferentialAttachment(
               p1, p2, {relationshipQuery: $relType}
            ) AS pa,
           algo.linkprediction.totalNeighbors(
               p1, p2, {relationshipQuery: $relType}
            ) AS tn
    """
    pairs = [
        {"node1": node1, "node2": node2}  for node1, node2 in input_df[["node1", "node2"]].values.tolist()
    ]
    return pd.merge(
        input_df, 
        graph.run(query, {"pairs": pairs, "relType": rel_type}).to_data_frame(), 
        on = ["node1", "node2"]
    )

training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
test_df = apply_graphy_features(test_df, "CO_AUTHOR_LATE")

fitting binary classifier

이제 classifier에 데이터를 학습시키고, 그 결과를 확인합니다.

classifier = RandomForestClassifier(n_estimators=30, max_depth=10, random_state=0)

## apply graphy features
training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
print(training_df.head(5))
print("== training_df add graphy featues ends")
test_df = apply_graphy_features(test_df, "CO_AUTHOR_LATE")
print(test_df.head(5))
print("== test_df add graphy featues ends")


## fitting 
columns = ["cn", "pa", "tn"]

X_train, y_train = training_df[columns], training_df["label"]
X_test, y_test = test_df[columns], test_df['label']

classifier.fit(X_train, y_train)
print("== ML RandomClassifier fitting start")
print("== ML RandomClassifier fitting ends")

y_test_pred = classifier.predict(X_test)

eval_df = evaluate_model(y_test_pred, y_test)
print(eval_df)

wrap-up

저는 원래, networkx를 주로 사용해서 데이터로부터 graphy feature를 뽑아냅니다. cypher에서 쓰고 있는 대부분의 알고리즘들은 당연히, networkx를 통해서도 적용이 가능한데요, 저에게는 cypher가 아직은 좀 낯설게 느껴지네요.
그래서 좀 이해하다가 좀 귀찮기도 해서 뒤쪽은 약간 대충 마무리했습니다. 흐름은 대충 알겠는데 너무 꼼꼼하게 다 볼 필요는 없다고 생각했거든요.
다만, 간단하게라도, link prediction을 분석하려면 어떤 방법으로 진행하면 좋은지에 대해서 정리할 수 있어서 좋았던 것 같습니다. 그리고, 원래 본문에는 이 외에도 좀더 다양한 알고리즘들을 사용하기는 했지만, 저는 여기에서 일단 마무리하게로 했습니다. 다른 알고리즘들은 그냥 networkx에서 같은 라이브러리를 써서 테스트해보는 것이 훨씬 효율적인 것 같아요

raw-code

from py2neo import Graph
import pandas as pd

import time
import datetime

import pickle

import matplotlib 
import matplotlib.pyplot as plt

#plt.style.use('fivethirtyeight')
#pd.set_option('display.float_format', lambda x: '%.3f' % x)

import pandas as pd
from collections import Counter

#import sklearn
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

def down_sample(df):
    # df: 'label' 칼럼을 가지고 있어야 하며, 
    # label 칼럼은 0과 1만으로 구성되어 있고, 0이 1보다 많아야 함.
    df_cp = df.copy() 
    # label 칼럼의 값들(0, 1로 구성)을 개수로 Counter로 구축.
    binary_count_dict = Counter(df_cp['label'].values)
    # 0, 즉, False의 개수 : 1, 즉, True의 개수
    # 0이 1보다 훨씬 많으므로 그 수만큼을 없애야 함.
    zero_count = binary_count_dict[0]
    one_count = binary_count_dict[1]
    # False인 것들을 찾고, n개 만큼 샘플링하여, index를 넘겨서 드롭해줌.
    df_cp = df_cp.drop(
        df_cp[df_cp['label'] == 0].sample(
            n=zero_count - one_count, random_state=1
        ).index
    )
    # 1.0
    return df_cp.sample(frac=1.0)
    
def apply_graphy_features(input_df, rel_type):
    # input_df: 'node1', 'node2', 'label'을 갖춘 dataframe 
    # rel_type : "CO_AUTHOR_EARLY" or "CO_AUTHOR_LATE"
    query = """
    UNWIND $pairs AS pair
    MATCH (p1) WHERE id(p1) = pair.node1
    MATCH (p2) WHERE id(p2) = pair.node2
    RETURN pair.node1 AS node1,
           pair.node2 AS node2,
           algo.linkprediction.commonNeighbors(
               p1, p2, {relationshipQuery: $relType}
            ) AS cn,
           algo.linkprediction.preferentialAttachment(
               p1, p2, {relationshipQuery: $relType}
            ) AS pa,
           algo.linkprediction.totalNeighbors(
               p1, p2, {relationshipQuery: $relType}
            ) AS tn
    """
    pairs = [{"node1": node1, "node2": node2}  for node1, node2 in input_df[["node1", "node2"]].values.tolist()]
    features_df = graph.run(query, {"pairs": pairs, "relType": rel_type}).to_data_frame()
    return pd.merge(input_df, features_df, on = ["node1", "node2"])
def evaluate_model(predictions, actual):
    return pd.DataFrame({
        "Measure": ["Accuracy", "Precision", "Recall"],
        "Score": [accuracy_score(actual, predictions), 
                  precision_score(actual, predictions), 
                  recall_score(actual, predictions)]
    })
# Change the line of code below to use the IP Address, Bolt Port, and Password of your Sandbox.
# graph = Graph("bolt://<IP Address>:<Bolt Port>", auth=("neo4j", "<Password>")) 

########################## 
graph = Graph("bolt://18.207.236.58:35863", auth=("neo4j", "<password>"))

SET_COAUTHOR_REL = False
if SET_COAUTHOR_REL:
    query = """
    // apoc.periodic.iterate는 일종의 작업관리(job management)로서, 
    // 한번에 너무 많은 일들을 시키면 데이터베이스에 무리가 갈 수 있으므로, 
    // 분할하여 일정한 batch size로 꾸준하게 처리해주는 것을 말합니다.
    CALL apoc.periodic.iterate(
        // first statement: 적용대상을 찾는 쿼리, 
        "MATCH (a1)<-[:AUTHOR]-(paper)-[:AUTHOR]->(a2:Author)
        WITH a1, a2, paper
        ORDER BY a1, paper.year
        RETURN a1, a2, collect(paper)[0].year AS year, count(*) AS collaborations",
        // second statement: first statement의 결과에 무엇을 적용할 것인가? 
        "MERGE (a1)-[coauthor:CO_AUTHOR {year: year}]-(a2)
        SET coauthor.collaborations = collaborations", 
        // config: batchSize 몇개나 적용할 것인가? 
        {batchSize: 100}
    )
    """
    print(graph.run(query).stats())
    print("== coauthor relationship updated")
    print("=="*30)

LABEL_EARLY_LATE = False
if LABEL_EARLY_LATE==True: 
    query = """
    MATCH (a)-[r:CO_AUTHOR]->(b) 
    where r.year < 2006
    MERGE (a)-[:CO_AUTHOR_EARLY {year: r.year}]-(b);
    """
    print(graph.run(query).stats())
    print("CO_AUTHOR_EARLY labeled")
    query = """
    MATCH (a)-[r:CO_AUTHOR]->(b) 
    where r.year >= 2006
    MERGE (a)-[:CO_AUTHOR_LATE {year: r.year}]-(b);
    """

    print(graph.run(query).stats())
    print("CO_AUTHOR_LATE labeled")

# TRAIN SET: CO_AUTHOR_EARLY
# TEST  SET: CO_AUTHOR_LATE
# train_set에서 Positive Sample을 뽑는다.

TRAIN_POSITIVE_READ = False
TRAIN_NEGATIVE_READ = False
TEST_POSITIVE_READ = False
TEST_NEGATIVE_READ = False

if TRAIN_POSITIVE_READ:
    print(f"== Train_POSITIVE Read {datetime.datetime.now()}")
    #start_time = time.time()
    #print(datetime.datetime.now())
    train_existing_links = graph.run("""
    MATCH (author:Author)-[:CO_AUTHOR_EARLY]->(other:Author)
    RETURN id(author) AS node1, id(other) AS node2, 1 AS label
    """).to_data_frame()
    #print(train_existing_links.describe())
    train_existing_links.to_pickle('TRAIN_POSITIVE.pkl')
    print(f"== Train_POSITIVE complete {datetime.datetime.now()}")
else:
    train_existing_links = pd.read_pickle("TRAIN_POSITIVE.pkl")
print("== Train True Desc")
#print(train_existing_links.describe())
print("=="*30)



# 속도가 참으로 오래 거리므로, 가능하면 로컬에 저장하시는 걸 추천드립니다.
if TRAIN_NEGATIVE_READ:
    print(f"== Train_NEGATIVE Read {datetime.datetime.now()}")
    train_missing_links = graph.run("""
    // 누군가와 CO_AUTHOR_EARLY 관계를 맺고 있는 author를 찾고
    MATCH (author:Author)
    WHERE (author)-[:CO_AUTHOR_EARLY]-()
    // author 2, 3단계 로 CO_AUTHOR_EARLY 관계를 맺고 있는 other_author를 찾고,
    MATCH (author)-[:CO_AUTHOR_EARLY*2..3]-(other_author)
    // author, other_author는 직접적으로 CO_AUTHOR_EARLY를 맺고 있지는 않고, 
    WHERE not((author)-[:CO_AUTHOR_EARLY]-(other_author))
    // author, other_author의 id를 리턴한다. 
    // 여기서, 0을 label로 리턴하는 것은 이것이 negative example 즉, False이기 때문.
    RETURN id(author) AS node1, id(other_author) AS node2, 0 AS label
    """).to_data_frame()

    train_missing_links = train_missing_links.drop_duplicates()
    train_missing_links.to_pickle('TRAIN_NEGATIVE.pkl')
    print(f"== Train_NEGATIVE complete {print(datetime.datetime.now())}")
else:
    train_missing_links = pd.read_pickle("TRAIN_NEGATIVE.pkl")
# 이제 train_existing_links, train_missing_links 각각에 원하는 데이터가 들어갔으며, 
# DB에 부하를 걸지 않기 위해 로컬에 피클로 저장하였다.
print("== Train False Desc")
#print(train_missing_links.describe())
print("=="*30)



if TEST_POSITIVE_READ:
    print(f"== Test_POSITIVE Read {datetime.datetime.now()}")
    test_existing_links = graph.run("""
    MATCH (author:Author)-[:CO_AUTHOR_LATE]->(other:Author)
    RETURN id(author) AS node1, id(other) AS node2, 1 AS label
    """).to_data_frame()

    test_existing_links.to_pickle('TEST_POSITIVE.pkl')
    print(f"== Test_POSITIVE complete {datetime.datetime.now()}")
else:
    test_existing_links = pd.read_pickle("TEST_POSITIVE.pkl")
print("== Test True Desc")
#print(train_missing_links.describe())
print("=="*30)
if TEST_NEGATIVE_READ:
    print(f"== Test_NEGATIVE Read {datetime.datetime.now()}")
    test_missing_links = graph.run("""
    MATCH (author:Author)
    WHERE (author)-[:CO_AUTHOR_LATE]-()
    MATCH (author)-[:CO_AUTHOR_LATE*2..3]-(other)
    WHERE not((author)-[:CO_AUTHOR_LATE]-(other))
    RETURN id(author) AS node1, id(other) AS node2, 0 AS label
    """).to_data_frame()
    test_missing_links = test_missing_links.drop_duplicates()
    test_missing_links.to_pickle('TEST_NEGATIVE.pkl')
    print(f"== Test_NEGATIVE complete {datetime.datetime.now()}")
else:
    test_missing_links = pd.read_pickle("TEST_NEGATIVE.pkl")

print("== Test False Desc")
#print(train_missing_links.describe())
print("=="*30)

training_df = train_missing_links.append(train_existing_links, ignore_index=True)
test_df = test_missing_links.append(test_existing_links, ignore_index=True)
#training_df['label'] = training_df['label'].astype('category')

training_df = down_sample(training_df)
test_df = down_sample(test_df)

## chooosing machine learning algorithm

classifier = RandomForestClassifier(n_estimators=30, max_depth=10, random_state=0)

training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
print(training_df.head(5))
print("== training_df add graphy featues ends")
test_df = apply_graphy_features(test_df, "CO_AUTHOR_LATE")
print(test_df.head(5))
print("== test_df add graphy featues ends")



columns = ["cn", "pa", "tn"]

X_train, y_train = training_df[columns], training_df["label"]
X_test, y_test = test_df[columns], test_df['label']

classifier.fit(X_train, y_train)
print("== ML RandomClassifier fitting start")
print("== ML RandomClassifier fitting ends")

y_test_pred = classifier.predict(X_test)

eval_df = evaluate_model(y_test_pred, y_test)
print(eval_df)

Twitter Facebook LinkedIn

frhyme

neo4j - data science - part 4.2

link prediction

Exercise 2: Building a binary classifier

Building a co-author graph

Train and Test split

Positive and Negative examples

Generating graphy features

fitting binary classifier

wrap-up

raw-code

공유하기

댓글남기기