Thursday, January 28, 2016

NoSQL Confession: I miss “joins”

I have been working with ElasticSearch and Couchbase this past year and a half, and I have a confession to make. Yes it has been amazing to see how to build a flexible, scalable and fast database that integrates sophisticated text searches with more traditional value matching queries. At Berkery Noyes, we have built, on NoSQL technology, an amazing tool to sift through millions of records of business intelligence including web visits, landing pages, emails, phone calls, merger and acquisition activity and even changes to company personnel. We use this tool to focus our efforts, and preliminary usage shows the search tool to be able to identify solid leads. But…


To do this, we have needed to de-normalize a large amount of data into our data documents. The NoSQL “join” features in both ElasticSearch and Couchbase are still too primitive to effectively use. I keep hitting problems like the inability to sort on “joined” documents, or severe latencies in updates to indices. In addition, there is the additional headache of insuring that de-normalized data is up to date. Oh for the old days of Primary Keys and Foreign Keys. However, for us, there is no turning back, the new power that our search app has is so useful that we deal with it. I guess I miss the old days of 2013 when we could have all our data tied up in a neat algebraic package on a SQL database. We live in a world that can be a little messy, so perhaps our data should reflect that.

Friday, July 31, 2015

Control Z in a NoSQL world

I was going on a trip with my 13 year old son, and out of the blue, he said "What if you can hit 'Command Z' in the real world?"  "What?" "What if you spilled your drink and you hit 'Command Z' and it never happened?" Silence indicating my mind blowing. As a millennial raised in age of the Internet and cloud computing, he was wondering if our real world actions can be as transient as our virtual world. In the laws of physics, entropy says that energy and structure will break down, and processes tend not to be reversible. But have we defied the laws of physics in our virtual world? I remember how awesome it was to use a word processor and learn how to fix typos. In high school, typos were fixed with correction tape and there always a little smudge to indicate that mistake was made. Meanwhile, 'Control Z' just makes your actions go away as if it never happened. My son's comments made me think if I wanted to 'Control Z' any thing I have done in my life. The reality is that like everyone I have regretted actions and decisions I have made and the resulting consequences. But these same actions may have also had some wonderful consequencesor at least lessons. So I don't think I would have used 'Control Z'. I would live with smudges in my life. 

In our virtual world, 'Control Z' is not as powerful as it seems. Our Email is archived and backed up to other systems. Gmail recently came out with a limited message unsend feature, but this only works for a limited period after hitting the send button. Facebook posts can be shared and propagated pretty soon after you have posted and even if you retract a post, it lives on in some archive. 

What does this mean in a NoSQL world? NoSQL databases are powering the internet these days, and though it is not required these data stores often contain denormalized data where data facts are often stored in multiple places. For instance a user name may be stored in a user profile and also in all that user's posts, comments and likes. Let's say a user gets married and as a traditional person she changes her last name. Then all her posts, comments and likes need to be updated. SQL databases usually solve this problem by storing the name once and using an immutable primary key for the user that is stored in the posts, comments and likes. Then the two tables are joined to show the name. NoSQL usually does not join its objects and hence the need to denormalize or repeat the data. This makes 'Contol Z' much harder and reminds of us that we do live in a world where actions leave a mark and those marks are hard to erase, and we have to live with the smudges that life gives us. 

Sunday, July 26, 2015

What is NoSQL?

I recently read Paul Ford's article, "What is code?", and it is a very compelling look into the ideas and culture of software and the people who create. He describes the religious adherence programmers have for the many competing technologies they use. Corporate vs. Startup, East coast vs. west coast, Windows vs. Linux vs. Mac, C vs. Ruby vs. Python vs. php vs. C# vs. Java. Django vs. Rails... There also is a bit of a divide between SQL and NoSQL. SQL is more definitive, there is an ANSI standard for SQL (structured query language). Schema are defined, and it is based on linear algebra. NoSQL is not defined. In fact there isn't even a real tight definition other than NOT SQL.
The site that is top ranked by Google for the word NoSQL, nosql-database.org, has a definition:

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

It is an interesting definition. SQL databases are very elegant and until the advent of the Web 2.0 seemed to handle any data problem you could throw at it. Web 2.0 and specifically Social Networks, exploded data, since this was when the users of a given site were not only consuming (reading downloading the data), but also contributing to the content of the site. Currently Facebook has over 1.4 billion users and most of those users are posting and creating content, and this was where SQL databases started to fail. Elegant SQL became a bottleneck, and none of the SQL databases out there could scale to what is now called Internet Scale. So new technologies had to be developed to serve data at Internet Scale and that pretty much what NoSQL is. Some of the other features are mostly just side effects of scaling issue.

Interestingly enough, there are cases where SQL databases have been used for Internet Scale sites. A good example is Instagram. In this posting, Instagram co-founder, Mike Krieger, explains how they used PostgreSQL scaled out to handle Instagram data. However, they had employ a lot of other NoSQL products like Memcache, Redis, and a fair amount of engineering to get it to work and handle the millions of users, creating content every day.

Due to it's success with Internet Scale projects, programmers have been using NoSQL for everything, and not necessarily with good results. I know on my recent projects we are starting to experiment with NoSQL to see if we get the speeds and optimizations that we need. However, I will not pretend that these projects are even approaching Internet Scale with may 150,000 users at best. However, after years of slow and complicated queries and indexes in SQL we do get the speed we need. Wish me luck.

Wednesday, June 17, 2015

NoSQL- structured vs unstructured what's better?

I recently had a conversation with a friend of mine, Dan Torres who is one of the most brilliant SQL data gurus I know and has taught me most of the meaningful SQL knowledge I have. We were talking about how I was adopting NoSQL, and he was concerned. He has built some extremely large and meaningful data sets with SQL and he said he had trouble with the lack of data structure in most applications of NoSQL.

I do agree that with structured data you can get some very meaningful insight out of your data, and that NoSQL does allow for looser data structure. But I think that with NoSQL you can build an evolutionary structure that evolves as your data evolves. I have more than once found that data modeling decisions I have made on day one of a project need to be changed and depending on what the change is, it can be quite painful to change in a SQL database.
In addition, often your inbound data is not structured and it can be quite challenging to convert that unstructured data into structured data. My team has dedicated data entry people who are looking at RSS feeds and extracting and entering deal information field by field into our database. This is not an automated process, though we have put some automated helpers in place. 

So part of my exploration of NoSQL is to see how it can help us tease structure and meaning out of unstructured data. We have built semantic technologies to see if we can help our research group classify deals and companies from unstructured text with some success, and I am hoping NoSQL can help us be more successful. 

Looking at Google and Yahoo!, we see they are addressing the same problem at a much greater scale. They are taking all the web pages on the web, which are extremely unstructured, and are gleaning structure and meaning from them so that we can find them in their search applications. And I know that they are not using SQL to structure their data. 

Monday, June 15, 2015

NoSQL - Open source equals more innovation?

As I have moved my team to using more NoSQL data stores, I found an interesting side topic to the SQL vs. NoSQL issues. The vast majority of NoSQL technology is open source. Open source has been around for a while. Started in 1985 by the Free Software Foundation and now fostered by the Open Source Initiative.

But we can look at the roots of the concept going back to the Founding Fathers with the creation of the U.S. Patent system, which may seem to be an anathema to our current views of open source. At that point, the U.S. government wanted to spur technical innovation, by allowing inventors to share their inventions and in return have the ability to license their shared inventions for a limited time (20 years which is forever in software time frames). After that time, the invention then enters the public domain. The alternative to a patent was to keep the invention secret, and not share it at all. This is what Coca-cola did, and it is generally called a trade secret.

So it is interesting, that recently patents have now taken on the light of inhibiting innovation. NoSQL has been the most significant innovation in data technology and almost all the main players are open source: MongoDB, Couchbase, CouchDB, Cassandra, Redis, Elasticsearch, Lucene, Hadoop... Here is one list from last year.

So what is it that makes open source drive innovation? I found in my entry into the world quite intriguing. I jumped on the forums to learn about how the new software I was testing. Soon I found I was poking around GitHub to understand how the software works. I noted that one of software packages did not work with the JVM I was using, and I tweaked some Java code (I am not a Java programmer). This tweak was then rolled up into GitHub where the moderates incorporated it into the next release. Similarly for another product, I found I needed some connection pooling and working with the community we came up with C# code to pool the connections and those modifications were rolled into the next release.

I really did not spend that much time on these code changes, but open source took my minor efforts and improved the overall code base in a way that proprietary code would never have done, and if thousands of programmers are doing this, you can see how powerful open source is in driving innovation and software development.

As I side note, I did just get a patent grant on some 3D integral photography software, I created with a friend. It took almost 5 years, and in those 5 years a lot software innovation has occurred!



Sunday, June 14, 2015

NoSQL - Old Masters Oil painting technique and Software development

Years ago I devoted a substantial amount of time to oil painting. It was a deep dive for 6 years, and I learned a lot about the art of oil painting including some of the old master techniques commonly used by Rembrandt, Vermeer, Goya, Velasquez and more. 

Though I don't use the skills I learned from that period today, I recently realized I use the attitude and approach I learned, especially in working with NoSQL. 

To generalize, the Old Masters technique was to work from thin to fat. The artist would start a painting using very thin paint using lots a turpentine and medium to thin the color. With the thin paint, the artist would then cover the whole canvas with color but in a very general way, not specifying any details but setting up the top level structure of the composition. Then the artist would wait a day or two for it to dry. Note that a productive studio would have many canvases in process. The painter then would add another layer of paint which was a little bit fatter and less turpentine defining pictorial elements a little more. Then the painter would let that dry. The painter then with each session would progressively paint fatter paint and more definitively until the painting was completed. 

How is this similar to building a NoSQL application? I would say that with NoSQL you should start your design very broadly and the flexible schema allows you to do so. You can then go fatter and refine and add details as you go because of the flexible schema. The result is a growing and living application that is agile enough to adjust to new needs and ideas. 

In contrast, I would say that SQL design is more akin to ink drawing where you make permanent, definitive and detailed decisions up front. Then you have to live with them for the rest of the life of your application. You could change the design later but not without some heartache. 

So if you are using NoSQL, let's see your masterpiece!

NoSQL Evangelism - Are we drinking the Koolaid?

Recently I went to a workshop to learn more about Neo4j, a NoSQL graph database. I sat at table with a bunch of other techies, as we heard all about the benefits of graph databases, and how the whole world can be modeled as a set of graph relationships. They say graphs are everywhere. One of my neighbors at the table asked "Have you drunk the Koolaid yet?".

This expression got me thinking about how technology is evangelized. Technology companies need developers and systems engineers to adopt their technology in order to be successful, and they need to convince them to consider their technology, and it especially difficult when developers have already gotten comfortable with other technologies.

I unfortunately remember the origin of the "Koolaid" reference. Back in the 1970's, there were a lot of cults, and at Jonestown, Guyana, 500 people from Jim Jones People's Temple died in a mass homocide-suicide from drink poison laced Koolaid.

As gruesome as the origin is, we still use the expression today to mean blindly adopting beliefs and taking actions without sensibly acknowledging the consequences. And in the 1970's many institutions had broken down, and many people were looking for new institutions to join to feel community and structure. Unfortunately, some of these new institutions were not so benign.

Back to technology evangelism...

I believe we are in a Golden Age of information technology. The speed at which new technologies are being developed is breathtaking. But, how do tech leaders and architects who's job is to adopt the appropriate new technologies and to know which ones to pick. So, we listen to all the pitches from all the up and coming companies about how great their technologies are, and how they will make the software we are creating even better. But, in order to learn if this is all true, we have to invest a fair amount time and money to determine if such and such technology will work for us, and improve our own software. What do we do?

We trust our instincts and drink the Koolaid.