Jerry's DevLog

안녕하세요! Jerry 입니다!

지난번에 Vector Search를 해보기 위해서…. MongoDB Atlas 세팅을 해보았습니다! (지난 포스팅)
하.지.만…. 이 내용을 보고 이런 생각을 하시는 분들이 계실 것 같았습니다…

Uhm….. 제 데이터로 하려면 어떻게 해야하죠….?

그래서 이번엔 MongoDB에 데이터를 적재하는 것부터 Vector Search를 하는 것 까지 포스팅을 해보기로 했습니다!

지난번 처럼 다음 Repository에 올려놨으니 참고하세요!
https://github.com/jjerry-k/atlas_local

예제 순서를 간단히 설명드리면 다음과 같습니다.

1. 데이터 추가
2. Vector Search를 위해 Atlas search index 등록
3. Vector Search 테스트

1. 데이터 추가

Vector search를 하던….뭘 하던….우선 DB에 데이터가 있어야하겠죠!
우선 DB에 데이터를 추가해줍니다! 저는 다음 Python 코드를 이용하여 데이터를 추가했습니다!

from pymongo import MongoClient
client = MongoClient(f"mongodb://root:root@localhost:27778/?directConnection=true")
db = client.sample
collection = db["example"]

import numpy as np

dimensions = 128

for i in range(9):
    collection.insert_one({
        "ID": i,
        "vector": np.random.random(dimensions).tolist()
    })
    
collection.insert_one({
        "ID": 9,
        "vector": np.ones(dimensions).tolist()
    })

간단하게 sample 이라는 DB의 example 이라는 collection의 vector 라는 필드에 float 값 128개를 가진 list를 10개 추가하였습니다.

예제를 실행해보시면 다음과 유사한 Document가 저장됩니다.

{
  "_id": {
    "$oid": "65d4bcb767093fdae0fe7ecc"
  },
  "ID": 3,
  "vector": [
    0.8295848971397636,
    0.2224359780682411,
    0.8689565429640854,
    0.0003077118196677109,
    0.8275545366948728,
    ...
  ]
}

2. Atlas search index 생성

Vector search를 사용하기 위해 Atlas Search index라는 데이터 구조를 생성해야합니다. 참고
이 과정에서 어떤 DB의 어떤 Collection의 어느 값을 이용하여 vector search를 할 것인지 구조를 선언해야하는데 예시 포맷은 다음과 같습니다.

{
    "database": "sample",
    "collectionName": "example",
    "mappings": {
        "fields": {
            "vector": {
                "type": "knnVector",
                "dimensions": 128,
                "similarity": "euclidean"
            }
        }
    },
    "name": "vectorSearchIndex"
}

간단하게! sample DB의 example collection 에서 vector 필드의 값으로 euclidean 유사도를 측정하도록 구조를 만들고 이를 vectorSearchIndex 라고 명칭을 붙였습니다.

3. Vector Search 테스트

마지막 차례입니다!
실제로 Vector Search를 진행해보려합니다.

import numpy as np
from pymongo import MongoClient

client = MongoClient(f"mongodb://root:root@localhost:27778/?directConnection=true")
db = client.sample

similar_docs = db["example"].aggregate([
    {
      "$vectorSearch": {
        "index": "vectorSearchIndex",
        "path": "vector",
        "queryVector": np.ones(128).tolist(),
        "numCandidates": 10,
        "limit": 10
      }
    },
    {
      "$project": {
        "_id": 0,
        "ID": 1,
        # "vector": 1,
        "score": { "$meta": "vectorSearchScore" }
      }
    }
  ])

for doc in similar_docs:
    print(doc)

이 또한 간단하게!
설명을 드리자면 sample DB의 example collection 에서 vectorSearchIndex 라는 이름을 가진 구조와 vector 라는 필드를 기준으로 Search를 수행할 것이고 queryVector와 유사한 값들을 limit 수 만큼만 가져오는 예시입니다. numCandidates는 Number of nearest neighbors 를 뜻합니다. (Atlas Vector Search가 ANN Search 기반!)
결과가 나왔다면 _id값에 대해선 보여주지 말고 ID, score 만 출력하도록 하는 예제입니다.

이 예제를 실행하면 다음과 같이 결과가 출력됩니다.

{'ID': 9, 'score': 1.0}
{'ID': 8, 'score': 0.02444259263575077}
{'ID': 4, 'score': 0.023382186889648438}
{'ID': 5, 'score': 0.02316288836300373}
{'ID': 6, 'score': 0.022944210097193718}
{'ID': 2, 'score': 0.0225890651345253}
{'ID': 1, 'score': 0.022497698664665222}
{'ID': 7, 'score': 0.022399913519620895}
{'ID': 3, 'score': 0.021212652325630188}
{'ID': 0, 'score': 0.019664492458105087}

vector 값 또한 같이 추출하고 싶다면 "vector": 1 주석을 해제하시면 됩니다.

이번 포스팅은 여기까지 입니다!
혹시나 궁금하시거나 잘못된 부분이 있다면 Repository에 이슈 등록해주시거나 댓글 남겨주세요!

감사합니다!

MongoDB Atlas 입문기 - (2)

1. 데이터 추가

2. Atlas search index 생성

3. Vector Search 테스트