Here is the implementation of Cosine similarity of two vectors (vec1,vec2) using java
------------------------------------------------------------------------------------------
public class cosine {
public static void main(String[] args) {
int vec1[] = {1,2,5,0,2,3};
int vec2[] = {2,1,3,2,0,1};
double cos_sim = cosine_similarity(vec1,vec2);
System.out.println("Cosine Similarity="+cos_sim);
}
private static double cosine_similarity(int[] vec1, int[] vec2) {
double dp = dot_product(vec1,vec2);
double magnitudeA = find_magnitude(vec1);
double magnitudeB = find_magnitude(vec2);
return (dp)/(magnitudeA*magnitudeB);
}
private static double find_magnitude(int[] vec) {
double sum_mag=0;
for(int i=0;i<vec.length;i++)
{
sum_mag = sum_mag + vec[i]*vec[i];
}
return Math.sqrt(sum_mag);
}
private static double dot_product(int[] vec1, int[] vec2) {
double sum=0;
for(int i=0;i<vec1.length;i++)
{
sum = sum + vec1[i]*vec2[i];
}
return sum;
}
}
Tuesday, March 26, 2013
Cosine similarity and vector space model
Vector
space model :The representation of a set of documents as vectors in a
common vector space is known as the vector space model and is
fundamental to a host of information retrieval operations
ranging from scoring documents on a query, document classification and
document clustering.
The
set of documents in a collection then may be viewed as a set of
vectors in a vector space, in which there is one axis for each term.
Cosine
Similarity :
To
quantify the similarity between two documents in the vector space ,we
have different approaches.The first one is to find the magnitude
difference between the document vectors.But this approach has a
drawback like if one document is very large than the other,then the
difference will be large even though they have similar contents.
To
overcome this drawback,the second approach is finding the cosine
similarity which compensates the document length.
The
cosine of two vectors can be easily derived by using the Euclidean
dot product formula:
a.b
= ||a||||b|| cosĪ
cos_sim(d1,d2)
= v(d1).v(d2)/||v(d1)|| ||v(d2)||
For
text matching, the attribute vectors A
and B
are usually the term frequency vectors of the documents. The cosine
similarity can be seen as a method of normalizing document length
during comparison.
In
the case of information retrieval the cosine similarity of two
documents will range from 0 to 1, since the term frequencies (tf-idf
weights) cannot be negative.
Let's
consider the following example,
Document1
- Gilbert: 3
- Hurricane: 2
- Rains: 1
- Storm: 2
- Winds: 2
Document2
- Gilbert: 2
- Hurricane: 1
- Rains: 0
- Storm: 1
- Winds: 2
We
want to know how similar these documents are, purely in terms of word
counts (and ignoring word order).
The
two vectors are, again:
a:
[3,2,1,2,2]
b:
[2,1,0,1,2]
The
cosine of the angle between them is about 0.9439.
By
measuring the angle between the vectors, we can get a good idea of
their similarity , and, to make things even easier, by taking the
Cosine of this angle, we have a nice 0 to 1 (or -1 to 1, depending
what and how we account for) value that is indicative of this
similarity. The smaller the angle, the bigger (closer to 1) the
cosine value, and also the bigger the similarity.This
gives us a great similarity metric with higher values meaning more
similar and lower values meaning less. Therefore, if we compute the
cosine similarity between the query vector and all the document
vectors, sort them in descending order, and select the documents with
top similarity, we will obtain an ordered list of relevant documents
to this query.
Jaccard coeffecient and implementation
The Jaccard index, also known as the Jaccard similarity coefficient , is a statistic used for comparing the similarity and diversity of sample sets.
The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
- Here is the simple implementation of jaccard coeffecient using java.
- --------------------------------------------------------------------
- import java.awt.List;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Scanner;
class jaccard {
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
System.out.println("Enter 1st word ");
String s1=scan.next();
System.out.println("Enter 2nd word ");
String s2=scan.next();
jaccard_coeffecient(s1,s2);
}
private static void jaccard_coeffecient(String s1, String s2) {
double j_coeffecient;
ArrayList<String> j1 = new ArrayList<String>();
ArrayList<String> j2 = new ArrayList<String>();
HashSet<String> set1 = new HashSet<String>();
HashSet<String> set2 = new HashSet<String>();
s1="$"+s1+"$";
s2="$"+s2+"$";
int j=0;
int i=3;
while(i<=s1.length())
{
j1.add(s1.substring(j, i));
j++;
i++;
}
j=0;
i=3;
while(i<=s2.length())
{
j2.add(s2.substring(j, i));
j++;
i++;
}
Iterator<String> itr1 = j1.iterator();
while (itr1.hasNext()) {
String element = itr1.next();
System.out.print(element + " ");
}
System.out.println();
Iterator<String> itr2 = j2.iterator();
while (itr2.hasNext()) {
String element = itr2.next();
System.out.print(element + " ");
}
System.out.println();
set2.addAll(j2);
set2.addAll(j1);
set1.addAll(j1);
set1.retainAll(j2);
System.out.println("Union="+set2.size());
System.out.println("Intersection="+set1.size());
j_coeffecient=((double)set1.size())/((double)set2.size());
System.out.println("Jaccard coeffecient="+j_coeffecient);
}
}
Labels:
coefficient,
commands,
information retrieval,
Internet,
jaccard,
java,
linux,
program,
web
Saturday, February 9, 2013
How to execute xquery files ? (A simple approach)
Firstly,install Saxon X Query processor from Ubuntu software center or synaptic package manager.
you need to install two separate packages
you need to install two separate packages
- Saxon XSLT Processor (libsaxon-java)
- Saxon-B XSLT Processor (libsaxonb-java)
- After installing both the packages,open terminal(shortcut-ctrl+alt+t)
- Change your directory that contains all the xml and xquery(.xqy) files.
- To run the xquery files,just run the below command with your sample.xqy(xquery file) in the terminal.
thank you for reading
cheers !!!
sGk
Friday, February 8, 2013
How to validate xml file with xml schema ?
You can validate your xml file with xml schema in many ways,but in Linux system,one simple way is using xmllint .
If you don't have xmllint package,just install it from terminal
$sudo apt-get install xmllint
Now to validate a xml file called test.xml with the corresponding xml schema file test.xsd,just type in terminal as given below
$xmllint --noout --schema test.xsd test.xml
output
test.xml validates
thank you for reading
cheers !!!
sGk
Tuesday, February 5, 2013
How to record your session in linux ?
You can record your session in Linux using "script" command.
This records the login session in a file.It is useful to store in a file all keystrokes as well as other output and error messages.
$script
script started, file is typescript
$
.
.
.
Finally , you can terminate the session by entering "exit"
$exit
script done, file is typescript
$
Now, you can view this file using cat command
$cat typescript
Always, when you start your session using script,it actually overwrites the previous session information.So, if you want to append to the existing file
$script -a
thank you for reading !!!
cheers !
sGk
This records the login session in a file.It is useful to store in a file all keystrokes as well as other output and error messages.
$script
script started, file is typescript
$
.
.
.
Finally , you can terminate the session by entering "exit"
$exit
script done, file is typescript
$
Now, you can view this file using cat command
$cat typescript
Always, when you start your session using script,it actually overwrites the previous session information.So, if you want to append to the existing file
$script -a
thank you for reading !!!
cheers !
sGk
Tuesday, January 29, 2013
Sample Shell Scripts
Sample Shell Scripts
2i to stdout.
#!/bin/bash
echo -n "Enter input i : "
read i
i=`expr 2 \* $i`
for j in `seq 1 $i`
do
echo "$j"
done
echo -n "Enter input i : "
read i
i=`expr 2 \* $i`
for j in `seq 1 $i`
do
echo "$j"
done
The first line indicates the path for the bash shell.Bash is a unix shell
which stands for Bourne-again shell.It is a command processor which pro-
cesses scripts. The second and third line takes the input from the user,echo
is a command which displays the message to stdout by default and read command reads the input from the user. Now to output the successive integers 1 to 2i to stdout,a for loop with seq command is used.seq command prints sequence of numbers.
which stands for Bourne-again shell.It is a command processor which pro-
cesses scripts. The second and third line takes the input from the user,echo
is a command which displays the message to stdout by default and read command reads the input from the user. Now to output the successive integers 1 to 2i to stdout,a for loop with seq command is used.seq command prints sequence of numbers.
--------------------------------------------------------------------------------------------------------------------------------------
List all executable files in the current directory.
#!/bin/bash
echo ‘ls -l | grep -- -..x‘
echo ‘ls -l | grep -- -..x‘
in the current directory,grep command is used. Initially ls -l lists all the files
in the current directory.The output of ls is given as input to grep command
using pipe. Grep command filters out the files with executables.
--------------------------------------------------------------------------------------------------------------------------------------
To modify the file name with the current date.
#!/bin/bash
d=‘date +%Y%m%d‘
str1=‘echo $1 | awk -F"." ’{print $1}’‘
str2=‘echo $1 | awk -F"." ’{print $2}’‘
name=$str1-$d.$str2
mv $1 $name
d=‘date +%Y%m%d‘
str1=‘echo $1 | awk -F"." ’{print $1}’‘
str2=‘echo $1 | awk -F"." ’{print $2}’‘
name=$str1-$d.$str2
mv $1 $name
The first line includes the path for the bash shell.The second line gets
the current date in the format specified.And awk command is used to split
the filename and its extension.And then it is concatenated in the specified
format and replaced with the new file name.
the current date in the format specified.And awk command is used to split
the filename and its extension.And then it is concatenated in the specified
format and replaced with the new file name.
--------------------------------------------------------------------------------------------------------------------------------------
To replace a word with another word in the file.
#!/bin/bash
touch $1;
string="horse is a animal,horse runs very fast"
echo $string > $1
sed -i ’s/horse/elephant/g’ $1
touch -t 201301011200.00 $1
touch $1;
string="horse is a animal,horse runs very fast"
echo $string > $1
sed -i ’s/horse/elephant/g’ $1
touch -t 201301011200.00 $1
The first line includes the path for the bash shell.The touch command
is used to create a new file.The file name is passed as command line argu-
ment.Then some data is added to the file.Sed command is used to replace the
horse word with the elephant word.The touch -t command is used to modify
the last modification time.
is used to create a new file.The file name is passed as command line argu-
ment.Then some data is added to the file.Sed command is used to replace the
horse word with the elephant word.The touch -t command is used to modify
the last modification time.
Monday, January 28, 2013
How to use gdb debugger ? (Debugging with GDB)
Steps to use gdb :
1) Compile your source code (xyz.c file) using -g option as follows
2) Execute your ./a.out file as follows
3) you will get gdb command line environment and then to start debugging you need to run the program.So, before running ,you need to set the breakpoint to start debug.
4) After setting the breakpoint ,run the program as follows
5) Now ,it starts execution and stops at the given breakpoint.
6) Next,you can run the program line by line as follows
7) Whenever you get segmentation fault,you can actually trace back from where this fault came by "bt"(backtrack) option.
8) And also ,you can print the value of any variable at any stage using "p" (print ) option as follows.
For example,if you have a variable like 'i'
There are lot more options which you can try similarly,for more information about options,you can refer to man page of gdb
thank you for reading !!!
cheers !
sGk
1) Compile your source code (xyz.c file) using -g option as follows
$ gcc -g xyz.c
2) Execute your ./a.out file as follows
$ gdb ./a.out
3) you will get gdb command line environment and then to start debugging you need to run the program.So, before running ,you need to set the breakpoint to start debug.
gdb$ b xyz.c.15
Here 15 indicates the line number in your source code file.4) After setting the breakpoint ,run the program as follows
gdb$ r
If the program needs command line arguments,then enter
gdb$ r arg1 arg2......so on
6) Next,you can run the program line by line as follows
gdb$ n
or
gdb$ s
Here ,option s also works similar to n(next line),but if you want to trace the program completely(including functions),then use "s" option.7) Whenever you get segmentation fault,you can actually trace back from where this fault came by "bt"(backtrack) option.
gdb$ bt
8) And also ,you can print the value of any variable at any stage using "p" (print ) option as follows.
For example,if you have a variable like 'i'
gdb$ p i
or
gdb$ print i
There are lot more options which you can try similarly,for more information about options,you can refer to man page of gdb
$ man gdb
cheers !
sGk
Sunday, January 27, 2013
linux basic commands
Explore Linux commands
1) To get the full hardware configuration details and information about each component,just enter
$ sudo lshw
2) To get the usage of your RAM memory
$ free -m
3) To know the operating mode
$ uname -a
4) To know architecture of cpu
$ lscpu
5) To upack the tar.gz file
$ tar -xvf filename.tar.gz
6)To install the .deb (debian) file
$dpkg -i filename.deb
Labels:
architecture,
commands,
deb,
linux,
memory usage,
mode,
sudo,
tar
Subscribe to:
Posts (Atom)