Showing posts with label Elasticsearch. Show all posts
Showing posts with label Elasticsearch. Show all posts

Thursday, January 28, 2016

NoSQL Confession: I miss “joins”

I have been working with ElasticSearch and Couchbase this past year and a half, and I have a confession to make. Yes it has been amazing to see how to build a flexible, scalable and fast database that integrates sophisticated text searches with more traditional value matching queries. At Berkery Noyes, we have built, on NoSQL technology, an amazing tool to sift through millions of records of business intelligence including web visits, landing pages, emails, phone calls, merger and acquisition activity and even changes to company personnel. We use this tool to focus our efforts, and preliminary usage shows the search tool to be able to identify solid leads. But…


To do this, we have needed to de-normalize a large amount of data into our data documents. The NoSQL “join” features in both ElasticSearch and Couchbase are still too primitive to effectively use. I keep hitting problems like the inability to sort on “joined” documents, or severe latencies in updates to indices. In addition, there is the additional headache of insuring that de-normalized data is up to date. Oh for the old days of Primary Keys and Foreign Keys. However, for us, there is no turning back, the new power that our search app has is so useful that we deal with it. I guess I miss the old days of 2013 when we could have all our data tied up in a neat algebraic package on a SQL database. We live in a world that can be a little messy, so perhaps our data should reflect that.

Monday, June 15, 2015

NoSQL - Open source equals more innovation?

As I have moved my team to using more NoSQL data stores, I found an interesting side topic to the SQL vs. NoSQL issues. The vast majority of NoSQL technology is open source. Open source has been around for a while. Started in 1985 by the Free Software Foundation and now fostered by the Open Source Initiative.

But we can look at the roots of the concept going back to the Founding Fathers with the creation of the U.S. Patent system, which may seem to be an anathema to our current views of open source. At that point, the U.S. government wanted to spur technical innovation, by allowing inventors to share their inventions and in return have the ability to license their shared inventions for a limited time (20 years which is forever in software time frames). After that time, the invention then enters the public domain. The alternative to a patent was to keep the invention secret, and not share it at all. This is what Coca-cola did, and it is generally called a trade secret.

So it is interesting, that recently patents have now taken on the light of inhibiting innovation. NoSQL has been the most significant innovation in data technology and almost all the main players are open source: MongoDB, Couchbase, CouchDB, Cassandra, Redis, Elasticsearch, Lucene, Hadoop... Here is one list from last year.

So what is it that makes open source drive innovation? I found in my entry into the world quite intriguing. I jumped on the forums to learn about how the new software I was testing. Soon I found I was poking around GitHub to understand how the software works. I noted that one of software packages did not work with the JVM I was using, and I tweaked some Java code (I am not a Java programmer). This tweak was then rolled up into GitHub where the moderates incorporated it into the next release. Similarly for another product, I found I needed some connection pooling and working with the community we came up with C# code to pool the connections and those modifications were rolled into the next release.

I really did not spend that much time on these code changes, but open source took my minor efforts and improved the overall code base in a way that proprietary code would never have done, and if thousands of programmers are doing this, you can see how powerful open source is in driving innovation and software development.

As I side note, I did just get a patent grant on some 3D integral photography software, I created with a friend. It took almost 5 years, and in those 5 years a lot software innovation has occurred!



Saturday, June 13, 2015

NoSQL Schema Design: "The Questions you have" vs. "The Answers I can give"

Migrating to NoSQL from SQL has been a real continuing challenge for us. SQL third normal form provided a very clear set of rules on how to design databases, and I had long mastered how to design SQL schema, and had become fairly adept at making indexes to optimize queries. When these tools seemed to fail then I would build some selective de-normalizations into our databases. However, with NoSQL we are given a choice to have a flexible schema and anything goes.

Where to start? When given a complete blank slate with no fixed rules or methodologies, I was not sure what to do. Fortunately, I had already moved our application architecture to a Service Oriented Architecture and my APIs already generated structured objects which could be easily ported to our new Couchbase NoSQL database, and that's what we did. So now we were finally dipping our toes into the NoSQL world. We then attached our Couchbase cluster to an ElasticSearch cluster, and then we could text search absolutely everything in our universe of data with one simple query. It was amazing. We could search all the people, companies, deals, emails, telephone numbers, notes, phone logs, email logs, street addresses, web addresses...everything with just one simple query. In SQL, this would be enormously difficult. You would have to create text indexes on all the numerous tables and then write a very complex union query to look at all the tables, and then who knew what the performance would be like. My users found this new ability interesting, but then I got these questions...

  • I would like to see only my stuff, or what my co-workers have
  • I would like to see stuff from a certain time period
  • I would like to see only companies within a certain revenue range or other criteria
  • I would like to see counts of what I want before I see the results, so that if there isn't anything there I don't want to bother seeing the results
  • I spoke to someone but I can't remember their name, but we talked about "blah blah" in the last year or maybe two years ago
I then realized that the object models I had used from my API were not going to work, and I realized that these object models were really based on doing CRUD (Create, Read, Update, Delete) operations on a SQL data store, and that I had to change the model.

I read online extensively. I posted on forums asking how to design a schema for NoSQL. I really did not see anything as definitive as third normal form. Finally, I saw on Stack Overflow an idea which cleared up everything. I paraphrase the idea:
With SQL databases you design your schema based on the answers you have, and with NoSQL databases you design your schema based on the questions you have.
I wish I could credit where I saw this idea, but it cleared everything up for me. Searching for the third normal form equivalent in NoSQL was pointless. Those rules were built to optimize space and simplify queries. It was based on linear algebra. It was perfect for a world of smaller clearly defined data. However, in this new world the questions became more important than the answers. The users expected information to be more loosely correlated, and they wanted their simple questions to be answered simply and quickly even though the underlying information can be quite complex. And so I begin now to restructure my schemas based not on the information I have but based on the questions my users have. Wish me luck.

Friday, June 12, 2015

Ventures into NoSQL

Last November, my team was sitting around our conference table trying to figure out how to take our search function to the next level. We wanted to build a mini-Google to effectively search and mine our data in any way possible. We had millions of records of data covering over 100,000 companies in our sectors.

At that point, we had already started to migrate to a Service Oriented Architecture, and so we wanted the supreme search service. Also at that point all our data was stored in a MS SQL server database which had served us well for 7 years. This server was starting to slow down noticeably. So we were ready to migrate to at least a newer version of SQL server. 

Back to our meeting, one of our guys said "how does Google do it?" And I said that they have a massive index stored on a grid of servers distributed all over the world. And at that point, we realized we have to build a massive index of our data on a much smaller scale, and it seemed that SQL was not the appropriate tool. So began our dive into the world of NoSQL. 

There is a bewildering array of technologies with key-value stores, document stores, wide column stores, graph databases, as well as several indexing engines based on Apache Lucene.

 At first we looked at graph databases, since our data tracked companies involved with mergers and acquisitions where companies folded into other companies and then spun off again. In addition, people were often serial entrepreneurs or they would be hired CEOs who would prepare companies for sale. A social graph of this world seemed appropriate. However it seemed that graph databases are highly tuned for find many levels of relationships it was not so obvious how they could tackle text searching and other types of searches.

Next we had a consultant build a small application in NodeJS using MongoDB in the cloud as a data store. MongoDB has the best reputation as a document store, and I knew a few others who were using it with their projects. However, we wanted to test it out, and our aged windows infrastructure at that point could only support Couchbase. So we figured that we could at least get a proof of concept up and running on our current environment. In addition, it had a plugin to connect it to ElasticSearch which is really powerful text and data indexing engine based on Lucene. And so we started...