RNN Word Classifier

A Language Classifier powered by Recurrent Neural Network(RNN) implemented in Python without AI libraries.

Features

The classifier classifies a word in English, Spanish, Finnish, Dutch, or Polish. The classifier outputs correctly at a rate of approximately 85%.
It is purely implemented with numpy and built-in libraries.

Model Architecture

  • Input Layer: 47 nodes representing 47 different characters
  • First Hidden Layer: 100 nodes
  • Second Hidden Layer: 100 nodes
  • Output Layer: 5 nodes representing 5 languages

The technique used in this project is called Recurrent Neural Network(RNN):





Here, an RNN is used to encode the word “cat” into a fixed-size vector h3.

Sample Run

Training until validation accuracy achieve a certain level:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
epoch 1 iteration 24 validation-accuracy 43.0%
shaking English ( 22.4%) Pred: Dutch |en 22%|es 20%|fi 18%|nl 26%|pl 14%
relaxing English ( 23.7%) Pred: Dutch |en 24%|es 20%|fi 18%|nl 25%|pl 13%
prophecy English ( 17.6%) Pred: Spanish |en 18%|es 24%|fi 24%|nl 16%|pl 19%
tiroteo Spanish ( 25.8%) |en 21%|es 26%|fi 18%|nl 18%|pl 17%
vientre Spanish ( 24.2%) |en 17%|es 24%|fi 21%|nl 21%|pl 17%
estupenda Spanish ( 31.4%) |en 16%|es 31%|fi 18%|nl 19%|pl 16%
osti Finnish ( 21.2%) Pred: Polish |en 15%|es 19%|fi 21%|nl 20%|pl 25%
veljensä Finnish ( 19.8%) Pred: Spanish |en 21%|es 22%|fi 20%|nl 20%|pl 18%
aikoinaan Finnish ( 22.3%) |en 15%|es 21%|fi 22%|nl 21%|pl 21%
betwijfel Dutch ( 22.8%) Pred: English |en 24%|es 23%|fi 15%|nl 23%|pl 15%
merkte Dutch ( 17.1%) Pred: Spanish |en 17%|es 22%|fi 22%|nl 17%|pl 21%
beseffen Dutch ( 24.5%) |en 21%|es 19%|fi 21%|nl 25%|pl 15%
kończę Polish ( 21.5%) Pred: Spanish |en 17%|es 23%|fi 20%|nl 18%|pl 21%
firmy Polish ( 20.7%) Pred: Finnish |en 15%|es 22%|fi 23%|nl 19%|pl 21%
decyzje Polish ( 16.2%) Pred: Dutch |en 19%|es 22%|fi 20%|nl 23%|pl 16%

.
.
.

epoch 6 iteration 153 validation-accuracy 84.2%
shaking English ( 86.4%) |en 86%|es 0%|fi 1%|nl 12%|pl 1%
relaxing English ( 84.6%) |en 85%|es 0%|fi 0%|nl 15%|pl 0%
prophecy English ( 54.2%) |en 54%|es 0%|fi 0%|nl 4%|pl 41%
tiroteo Spanish ( 38.9%) |en 12%|es 39%|fi 36%|nl 6%|pl 8%
vientre Spanish ( 43.4%) |en 19%|es 43%|fi 2%|nl 29%|pl 7%
estupenda Spanish ( 75.2%) |en 1%|es 75%|fi 15%|nl 2%|pl 7%
osti Finnish ( 75.7%) |en 1%|es 1%|fi 76%|nl 3%|pl 20%
veljensä Finnish ( 81.7%) |en 0%|es 1%|fi 82%|nl 17%|pl 0%
aikoinaan Finnish ( 99.9%) |en 0%|es 0%|fi100%|nl 0%|pl 0%
betwijfel Dutch ( 98.7%) |en 1%|es 0%|fi 0%|nl 99%|pl 1%
merkte Dutch ( 71.9%) |en 10%|es 1%|fi 6%|nl 72%|pl 10%
beseffen Dutch ( 96.6%) |en 2%|es 0%|fi 0%|nl 97%|pl 0%
kończę Polish (100.0%) |en 0%|es 0%|fi 0%|nl 0%|pl100%
firmy Polish ( 29.4%) Pred: English |en 59%|es 5%|fi 2%|nl 5%|pl 29%
decyzje Polish ( 87.7%) |en 1%|es 1%|fi 0%|nl 10%|pl 88%

Test Results:

1
test set accuracy is: 83.800000%

User Input:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
word: tervetuloa # welcome
predicted language is: Finnish, with a confidence of 80.011147%

word: ciudades # cities
predicted language is: Spanish, with a confidence of 88.442353%

word: właź # hatch
predicted language is: Polish, with a confidence of 99.979566%

word: algorithm
predicted language is: English, with a confidence of 79.893499%

word: resolution
predicted language is: English, with a confidence of 94.786443%

word: ademt # breathe
predicted language is: Dutch, with a confidence of 47.399565%

word: invitar # invite
predicted language is: Spanish, with a confidence of 93.986880%

Dependencies

You will need numpy for this project

1
pip install numpy

How To Use

clone this project or download the zip file

1
py run.py

Improvements To Make

  • support save & load models
  • classify more languages
  • improve accuracy
  • classify a sentence or paragraph instead of words

Reference

The dataset lang_id.npz, image demonstrating RNN, and project skeleton are from cs188.ml.