SentiSense is a project Daniel Jiang and I built for CalHacks 3.0. SentiSense takes in a body of text - that could be input from a Android or iPhone keyboard or perhaps all outgoing emails - and predicts a sentence you may want to write based on the beginning of a sentence.
SentiSense was a pretty fun project to work on. Daniel and I determined the idea and thought of the functionality and libraries during the week before, but all the implementation was done within a little over 24 hours. We were rolling very quickly. The challenge of figuring out how to implement scikit’s machine learning was the most fun part. With some careful thought into structure, we were able to get impressive results fairly quickly with minimal tuning.
Representing Sentences as Data
The first challenge we faced was how to represent sentences as data for the machine learning. After looking around, our original plan was to use the bag-of-words method. N-grams turned out to be a better option as it retained the order of words within sentences.
Determining a Machine Learning Algorithm
We then faced what type of machine learning to use. We decided to use the scikit-learn library due to both its friendliness to beginners and extensiveness. Our goal was to identify similar sentences that appear often. These types of sentences would make could candidates for suggestions. Thus, we were looking for clustering through unsupervised learning. scikit-learn offered multiple methods for clustering, displayed below.
Chart produced by scikit-learn
The chart above shows the results of using each method on a variety of different 2-dimensional data sets. Our data most resembled the third row. The vectorized sentences would be tightly clustered if they were similar, else relatively distance from other sentences. Only three methods correctly identified three distinct clusters.
We used DBSCAN - density-based spatial clustering of applications with noise. The algorithm is most appropriate for data in high density clusters separated by areas of low density. DBSCAN was also able to decide by itself the number of clusters in the data. An additional perk is that DBSCAN had the fastest runtime of the three methods on the example dataset. Thus we were able to successfully identify groups of similar sentences.
One of the unexpected challenges of the project was identifying the similar structure between sentences of one group. There were two main types of groups. Static groups were composed of sentences that appeared multiple times exactly the same. The similar structure was trivial to identify here.
Dynamic groups were composed of sentences that contained variation. Here’s an example that our learning was able to identify:
I'll see you at 2. I'll see you at two. I'll see you at three o'clock. I'll see you at the end. I'll see you later.
These sentences all had different endings, but a different sentence group could have had different beginnings or middles. Sentences could have also had multiple segments that changed.
Determining the common sentence structure between these sentences was a much more difficult task than expected. During the hackathon, I spent too much time on this problem considering all the possible edge cases. In the end, we decided to concentrate on the most likely scenarios and implement a solution that pulled from the common order and common words between sentences.
We wrapped up our template sentences in a flask server. Thus, we could make requests containing the starts of sentences and receive a suggestion with a possible ending to the sentence.