## Identifier Normalizer v0.1.0

### 요약
- 주어진 텍스트에 있는 식별자들을 찾아서 고유 아이디로 대치
- 입력은 항상 raw text를 가정 (HTML 페이지에서 bs4 등을 이용하여 텍스트 부분만 추출한 것)
- 테스트 환경: Python 3.6+ 
- 의존 라이브러리: [spacy==3.0.1](https://pypi.org/project/spacy/3.0.1/) (for tokenization), [cryptoaddress](https://pypi.org/project/cryptoaddress/) (암호화폐 주소 식별)   


### 대상 식별자

|종류|코드|설명|예시|
|------|---|---|---|
|이메일|ID_EMAIL|||
|ONION URL|ID_ONION_URL|||
|일반 URL|ID_NORMAL_URL|||
|IP 주소|ID_IP_ADDRESS|||
|비트코인 주소|ID_BTC_ADDRESS|||
|이더리움 주소|ID_ETH_ADDRESS|||
|라이트코인 주소|ID_LTC_ADDRESS|||
|암호 화폐 금액|ID_CRYPTO_MONEY|||
|일반 화폐 금액|ID_GENERAL_MONEY|||
|길이 수치|ID_LENGTH|||
|무게 수치|ID_WEIGHT|||
|부피 수치|ID_VOLUME|||
|퍼센트|ID_PERCENTAGE|||
|버전명|ID_VERSION|||
|파일 이름|ID_FILENAME|||
|파일 크기|ID_FILESIZE|||
|시간|ID_TIME|||
|브랜드/상표 이름|ID_BRAND_NAME|||
|숫자|ID_NUMBER|||


### 설치 및 실행

준비
```
python3 -m venv ./venv
source venv/bin/activate
pip install spacy cryptoaddress
python -m spacy download en_core_web_sm
```

실행 (데모)
```python
import spacy
import normalizer_main
texts = [
    "To participate, you just need to send from 0.01 BTC to 20 BTC to the contribution address and"
    " we will immediately send you back 0.2 BTC to 40 BTC to the address you sent it from. (x2 back)\n\n"
    "SPECIAL OFFER:\n"
    "If you send 5+ BTC, you will be airdropped 10 BTC back +35% bonus\n\n"
    "Payment Address\n"
    "You can send BTC to the following address:\n"
    "164auQnEcxJQs5ea1WVAtpYFfaKDbDek6T"
    ,
    
    "iPhone 11 Pro Max 64 GB - $749\n"
    "iPhone 11 Pro Max 256 GB - $899\n"
    "iPhone 11 Pro Max 512 GB - $999"
    ,
    
    "The wallets have a balance between 10 ₿ and 0.01 ₿, depending on how much I want to get rid of.\n"
    "The price is always 50% of the balance.\n"
    "This wallet has a value of 1.6 BTC and received its balance on 03/20/2020."
    ,
    
    "Self: /index.php\n"
    "MyURL: http://mgioamqnhbbxkos4.onion:80//index.php\n"
    "Server Address: [127.0.0.1:80]\n"
    "Server Name: 'mgioamqnhbbxkos4.onion'\n"
    "Remote Address: [127.0.0.1] Port 41538"
    ,
    
    "Specifications:\n"
    "Caliber: 7,62x51mm NATO"
    "Operation: Gas operated rotating bolt\n"
    "Magazine Capacity: 5 - 10 - 20 rounds\n"
    "Length: 1029 mm\n"
    "Barrel Length: 457 mm\n"
    "Weight: 5,440 kg\n"
    "Price on market 13000$"
    ,
    
    "Jambler.io Partner BTC Mixer Bitcoin\n"
    "Official TOR Mirror:\n"
    "overtsgjd4xmgu25uegho7p3ez47solhiri5xpylcgm2tlofbafrzwid.onion"
]

# Lemmatizaion용으로 "tagger"는 필요, "parser", "ner"은 불필요
spacy_nlp_model = spacy.load('en_core_web_sm', exclude=["parser", "ner"])

for text in texts:
    print('------- Original Text -------')
    print(text)
    text = normalizer_main.preprocess(text, spacy_nlp=spacy_nlp_model)
    print('------- Normalized Text --------')
    print(text)
    print()

```

실행 결과
```
------- Original Text -------
To participate, you just need to send from 0.01 BTC to 20 BTC to the contribution address and we will immediately send you back 0.2 BTC to 40 BTC to the address you sent it from. (x2 back)

SPECIAL OFFER:
If you send 5+ BTC, you will be airdropped 10 BTC back +35% bonus

Payment Address
You can send BTC to the following address:
164auQnEcxJQs5ea1WVAtpYFfaKDbDek6T
------- Normalized Text --------
to participate you just need to send from ID_CRYPTO_MONEY to ID_CRYPTO_MONEY to the contribution address and we will immediately send you back ID_CRYPTO_MONEY to ID_CRYPTO_MONEY to the address you send it from x2 back

special offer
if you send ID_CRYPTO_MONEY you will be airdrop ID_CRYPTO_MONEY back + ID_PERCENTAGE bonus

payment address
you can send btc to the following address
ID_BTC_ADDRESS

------- Original Text -------
iPhone 11 Pro Max 64 GB - $749
iPhone 11 Pro Max 256 GB - $899
iPhone 11 Pro Max 512 GB - $999
------- Normalized Text --------
iphone ID_NUMBER pro max ID_FILESIZE ID_GENERAL_MONEY
iphone ID_NUMBER pro max ID_FILESIZE ID_GENERAL_MONEY
iphone ID_NUMBER pro max ID_FILESIZE ID_GENERAL_MONEY

------- Original Text -------
The wallets have a balance between 10 ₿ and 0.01 ₿, depending on how much I want to get rid of.
The price is always 50% of the balance.
This wallet has a value of 1.6 BTC and received its balance on 03/20/2020.
------- Normalized Text --------
the wallet have a balance between ID_CRYPTO_MONEY and ID_CRYPTO_MONEY depend on how much i want to get rid of
the price be always ID_PERCENTAGE of the balance
this wallet have a value of ID_CRYPTO_MONEY and receive its balance on ID_TIME

------- Original Text -------
Self: /index.php
MyURL: http://mgioamqnhbbxkos4.onion:80//index.php
Server Address: [127.0.0.1:80]
Server Name: 'mgioamqnhbbxkos4.onion'
Remote Address: [127.0.0.1] Port 41538
------- Normalized Text --------
self /ID_FILENAME
myurl ID_NORMAL_URL
server address ID_IP_ADDRESS
server name ID_ONION_URL
remote address ID_IP_ADDRESS port ID_NUMBER

------- Original Text -------
Specifications:
Caliber: 7,62x51mm NATOOperation: Gas operated rotating bolt
Magazine Capacity: 5 - 10 - 20 rounds
Length: 1029 mm
Barrel Length: 457 mm
Weight: 5,440 kg
Price on market 13000$
------- Normalized Text --------
specification
caliber ID_NUMBER,62xID_LENGTH natooperation gas operate rotate bolt
magazine capacity ID_NUMBER ID_NUMBER round
length ID_LENGTH
barrel length ID_LENGTH
weight ID_WEIGHT
price on market ID_GENERAL_MONEY

------- Original Text -------
Jambler.io Partner BTC Mixer Bitcoin
Official TOR Mirror:
overtsgjd4xmgu25uegho7p3ez47solhiri5xpylcgm2tlofbafrzwid.onion
------- Normalized Text --------
ID_NORMAL_URL partner btc mixer bitcoin
official tor mirror
ID_ONION_URL
```

### Normalization 및 Jaccard similarity를 이용한 문서 유사도 계산 

- 주어진 두 텍스트를 각각 정규화하여 적용하여 파일 크기, 화폐 금액 등이 서로 다른 것을 통일
- 정규화된 각 텍스트의 토큰 집합을 구한 후, 토큰 집합 간 Jaccard similarity 계산 (0과 1 사이 값)
- Jaccard similarity 참고: https://wikidocs.net/24654

```python
import spacy
import normalizer_main

spacy_nlp_model = spacy.load('en_core_web_sm', exclude=["parser", "ner"])

text1 = (
    "iPhone 11 Pro Max 64 GB - $749\n"
    "iPhone 11 Pro Max 256 GB - $899\n"
)
text2 = (
    "iPhone 12 Pro Max 256 GB - $1099\n"
    "iPhone 12 Pro Max 512 GB - $1299\n"
)

orig_text1_token_set = set(text1.split())
orig_text2_token_set = set(text2.split())
orig_jaccard_sim = normalizer_main.compute_jaccard_similarity_between_two_token_sets(orig_text1_token_set, orig_text2_token_set)

print('------- Text 1 (original) --------')
print(text1)
print('------- Text 2 (original) --------')
print(text2)
print(f'Jaccard similarity: {orig_jaccard_sim:.2f}')
print('\n')

norm_text1 = normalizer_main.preprocess(text1, spacy_nlp=spacy_nlp_model)
norm_text2 = normalizer_main.preprocess(text2, spacy_nlp=spacy_nlp_model)
norm_text1_token_set = set(norm_text1.split())
norm_text2_token_set = set(norm_text2.split())
norm_jaccard_sim = normalizer_main.compute_jaccard_similarity_between_two_token_sets(norm_text1_token_set, norm_text2_token_set)

print('------- Text 1 (normalized) --------')
print(norm_text1)
print('------- Text 2 (normalized) --------')
print(norm_text2)
print(f'Jaccard similarity: {norm_jaccard_sim:.2f}')
print()
```

실행 결과
```
------- Text 1 (original) --------
iPhone 11 Pro Max 64 GB - $749
iPhone 11 Pro Max 256 GB - $899

------- Text 2 (original) --------
iPhone 12 Pro Max 256 GB - $1099
iPhone 12 Pro Max 512 GB - $1299

Jaccard similarity: 0.43


------- Text 1 (normalized) --------
iphone ID_NUMBER pro max ID_FILESIZE ID_GENERAL_MONEY
iphone ID_NUMBER pro max ID_FILESIZE ID_GENERAL_MONEY

------- Text 2 (normalized) --------
iphone ID_NUMBER pro max ID_FILESIZE ID_GENERAL_MONEY
iphone ID_NUMBER pro max ID_FILESIZE ID_GENERAL_MONEY

Jaccard similarity: 1.00

```
