Wednesday, June 17, 2015

NoSQL- structured vs unstructured what's better?

I recently had a conversation with a friend of mine, Dan Torres who is one of the most brilliant SQL data gurus I know and has taught me most of the meaningful SQL knowledge I have. We were talking about how I was adopting NoSQL, and he was concerned. He has built some extremely large and meaningful data sets with SQL and he said he had trouble with the lack of data structure in most applications of NoSQL.

I do agree that with structured data you can get some very meaningful insight out of your data, and that NoSQL does allow for looser data structure. But I think that with NoSQL you can build an evolutionary structure that evolves as your data evolves. I have more than once found that data modeling decisions I have made on day one of a project need to be changed and depending on what the change is, it can be quite painful to change in a SQL database.
In addition, often your inbound data is not structured and it can be quite challenging to convert that unstructured data into structured data. My team has dedicated data entry people who are looking at RSS feeds and extracting and entering deal information field by field into our database. This is not an automated process, though we have put some automated helpers in place. 

So part of my exploration of NoSQL is to see how it can help us tease structure and meaning out of unstructured data. We have built semantic technologies to see if we can help our research group classify deals and companies from unstructured text with some success, and I am hoping NoSQL can help us be more successful. 

Looking at Google and Yahoo!, we see they are addressing the same problem at a much greater scale. They are taking all the web pages on the web, which are extremely unstructured, and are gleaning structure and meaning from them so that we can find them in their search applications. And I know that they are not using SQL to structure their data. 

Monday, June 15, 2015

NoSQL - Open source equals more innovation?

As I have moved my team to using more NoSQL data stores, I found an interesting side topic to the SQL vs. NoSQL issues. The vast majority of NoSQL technology is open source. Open source has been around for a while. Started in 1985 by the Free Software Foundation and now fostered by the Open Source Initiative.

But we can look at the roots of the concept going back to the Founding Fathers with the creation of the U.S. Patent system, which may seem to be an anathema to our current views of open source. At that point, the U.S. government wanted to spur technical innovation, by allowing inventors to share their inventions and in return have the ability to license their shared inventions for a limited time (20 years which is forever in software time frames). After that time, the invention then enters the public domain. The alternative to a patent was to keep the invention secret, and not share it at all. This is what Coca-cola did, and it is generally called a trade secret.

So it is interesting, that recently patents have now taken on the light of inhibiting innovation. NoSQL has been the most significant innovation in data technology and almost all the main players are open source: MongoDB, Couchbase, CouchDB, Cassandra, Redis, Elasticsearch, Lucene, Hadoop... Here is one list from last year.

So what is it that makes open source drive innovation? I found in my entry into the world quite intriguing. I jumped on the forums to learn about how the new software I was testing. Soon I found I was poking around GitHub to understand how the software works. I noted that one of software packages did not work with the JVM I was using, and I tweaked some Java code (I am not a Java programmer). This tweak was then rolled up into GitHub where the moderates incorporated it into the next release. Similarly for another product, I found I needed some connection pooling and working with the community we came up with C# code to pool the connections and those modifications were rolled into the next release.

I really did not spend that much time on these code changes, but open source took my minor efforts and improved the overall code base in a way that proprietary code would never have done, and if thousands of programmers are doing this, you can see how powerful open source is in driving innovation and software development.

As I side note, I did just get a patent grant on some 3D integral photography software, I created with a friend. It took almost 5 years, and in those 5 years a lot software innovation has occurred!



Sunday, June 14, 2015

NoSQL - Old Masters Oil painting technique and Software development

Years ago I devoted a substantial amount of time to oil painting. It was a deep dive for 6 years, and I learned a lot about the art of oil painting including some of the old master techniques commonly used by Rembrandt, Vermeer, Goya, Velasquez and more. 

Though I don't use the skills I learned from that period today, I recently realized I use the attitude and approach I learned, especially in working with NoSQL. 

To generalize, the Old Masters technique was to work from thin to fat. The artist would start a painting using very thin paint using lots a turpentine and medium to thin the color. With the thin paint, the artist would then cover the whole canvas with color but in a very general way, not specifying any details but setting up the top level structure of the composition. Then the artist would wait a day or two for it to dry. Note that a productive studio would have many canvases in process. The painter then would add another layer of paint which was a little bit fatter and less turpentine defining pictorial elements a little more. Then the painter would let that dry. The painter then with each session would progressively paint fatter paint and more definitively until the painting was completed. 

How is this similar to building a NoSQL application? I would say that with NoSQL you should start your design very broadly and the flexible schema allows you to do so. You can then go fatter and refine and add details as you go because of the flexible schema. The result is a growing and living application that is agile enough to adjust to new needs and ideas. 

In contrast, I would say that SQL design is more akin to ink drawing where you make permanent, definitive and detailed decisions up front. Then you have to live with them for the rest of the life of your application. You could change the design later but not without some heartache. 

So if you are using NoSQL, let's see your masterpiece!

NoSQL Evangelism - Are we drinking the Koolaid?

Recently I went to a workshop to learn more about Neo4j, a NoSQL graph database. I sat at table with a bunch of other techies, as we heard all about the benefits of graph databases, and how the whole world can be modeled as a set of graph relationships. They say graphs are everywhere. One of my neighbors at the table asked "Have you drunk the Koolaid yet?".

This expression got me thinking about how technology is evangelized. Technology companies need developers and systems engineers to adopt their technology in order to be successful, and they need to convince them to consider their technology, and it especially difficult when developers have already gotten comfortable with other technologies.

I unfortunately remember the origin of the "Koolaid" reference. Back in the 1970's, there were a lot of cults, and at Jonestown, Guyana, 500 people from Jim Jones People's Temple died in a mass homocide-suicide from drink poison laced Koolaid.

As gruesome as the origin is, we still use the expression today to mean blindly adopting beliefs and taking actions without sensibly acknowledging the consequences. And in the 1970's many institutions had broken down, and many people were looking for new institutions to join to feel community and structure. Unfortunately, some of these new institutions were not so benign.

Back to technology evangelism...

I believe we are in a Golden Age of information technology. The speed at which new technologies are being developed is breathtaking. But, how do tech leaders and architects who's job is to adopt the appropriate new technologies and to know which ones to pick. So, we listen to all the pitches from all the up and coming companies about how great their technologies are, and how they will make the software we are creating even better. But, in order to learn if this is all true, we have to invest a fair amount time and money to determine if such and such technology will work for us, and improve our own software. What do we do?

We trust our instincts and drink the Koolaid.

Saturday, June 13, 2015

NoSQL Schema Design: "The Questions you have" vs. "The Answers I can give"

Migrating to NoSQL from SQL has been a real continuing challenge for us. SQL third normal form provided a very clear set of rules on how to design databases, and I had long mastered how to design SQL schema, and had become fairly adept at making indexes to optimize queries. When these tools seemed to fail then I would build some selective de-normalizations into our databases. However, with NoSQL we are given a choice to have a flexible schema and anything goes.

Where to start? When given a complete blank slate with no fixed rules or methodologies, I was not sure what to do. Fortunately, I had already moved our application architecture to a Service Oriented Architecture and my APIs already generated structured objects which could be easily ported to our new Couchbase NoSQL database, and that's what we did. So now we were finally dipping our toes into the NoSQL world. We then attached our Couchbase cluster to an ElasticSearch cluster, and then we could text search absolutely everything in our universe of data with one simple query. It was amazing. We could search all the people, companies, deals, emails, telephone numbers, notes, phone logs, email logs, street addresses, web addresses...everything with just one simple query. In SQL, this would be enormously difficult. You would have to create text indexes on all the numerous tables and then write a very complex union query to look at all the tables, and then who knew what the performance would be like. My users found this new ability interesting, but then I got these questions...

  • I would like to see only my stuff, or what my co-workers have
  • I would like to see stuff from a certain time period
  • I would like to see only companies within a certain revenue range or other criteria
  • I would like to see counts of what I want before I see the results, so that if there isn't anything there I don't want to bother seeing the results
  • I spoke to someone but I can't remember their name, but we talked about "blah blah" in the last year or maybe two years ago
I then realized that the object models I had used from my API were not going to work, and I realized that these object models were really based on doing CRUD (Create, Read, Update, Delete) operations on a SQL data store, and that I had to change the model.

I read online extensively. I posted on forums asking how to design a schema for NoSQL. I really did not see anything as definitive as third normal form. Finally, I saw on Stack Overflow an idea which cleared up everything. I paraphrase the idea:
With SQL databases you design your schema based on the answers you have, and with NoSQL databases you design your schema based on the questions you have.
I wish I could credit where I saw this idea, but it cleared everything up for me. Searching for the third normal form equivalent in NoSQL was pointless. Those rules were built to optimize space and simplify queries. It was based on linear algebra. It was perfect for a world of smaller clearly defined data. However, in this new world the questions became more important than the answers. The users expected information to be more loosely correlated, and they wanted their simple questions to be answered simply and quickly even though the underlying information can be quite complex. And so I begin now to restructure my schemas based not on the information I have but based on the questions my users have. Wish me luck.

Friday, June 12, 2015

Ventures into NoSQL

Last November, my team was sitting around our conference table trying to figure out how to take our search function to the next level. We wanted to build a mini-Google to effectively search and mine our data in any way possible. We had millions of records of data covering over 100,000 companies in our sectors.

At that point, we had already started to migrate to a Service Oriented Architecture, and so we wanted the supreme search service. Also at that point all our data was stored in a MS SQL server database which had served us well for 7 years. This server was starting to slow down noticeably. So we were ready to migrate to at least a newer version of SQL server. 

Back to our meeting, one of our guys said "how does Google do it?" And I said that they have a massive index stored on a grid of servers distributed all over the world. And at that point, we realized we have to build a massive index of our data on a much smaller scale, and it seemed that SQL was not the appropriate tool. So began our dive into the world of NoSQL. 

There is a bewildering array of technologies with key-value stores, document stores, wide column stores, graph databases, as well as several indexing engines based on Apache Lucene.

 At first we looked at graph databases, since our data tracked companies involved with mergers and acquisitions where companies folded into other companies and then spun off again. In addition, people were often serial entrepreneurs or they would be hired CEOs who would prepare companies for sale. A social graph of this world seemed appropriate. However it seemed that graph databases are highly tuned for find many levels of relationships it was not so obvious how they could tackle text searching and other types of searches.

Next we had a consultant build a small application in NodeJS using MongoDB in the cloud as a data store. MongoDB has the best reputation as a document store, and I knew a few others who were using it with their projects. However, we wanted to test it out, and our aged windows infrastructure at that point could only support Couchbase. So we figured that we could at least get a proof of concept up and running on our current environment. In addition, it had a plugin to connect it to ElasticSearch which is really powerful text and data indexing engine based on Lucene. And so we started...

Thursday, June 11, 2015

Scrum revisionism

On one of the CTO mailing lists I am on, there was a thread on scrum and whether it was just another useless or harmful fad in software development. The seed of this conversation was a post by Michael Church https://michaelochurch.wordpress.com/2015/06/06/why-agile-and-especially-scrum-are-terrible/  which severely critiques scrum and agile methodologies. Some of his points was that agile infantilizes programmers and inhibits their creativity. Timeboxing, sprints also tend to focus on short term goals and deters long visions, and is insulting to senior programmers. 

This conversation led me to review my team's adoption of scrum and agile management. The first thought I had was whether we view agile as a method or as a framework.  Given our team's history, I would say we use it as a framework. Our business has a very horizontal organization more like a law firm, and Church's critique stems from agile being used as a top down management technique. Since we tend to work more as peers, we use agile as a framework to take our mutual long term vision and break it down to manageable pieces to make progress. In fact we have so much creativity going on, we need scrum in order to channel it and prioritize the fire hose of ideas. These are the benefits we are are seeing. Church sees benefits only to management in scrum's ability to track programmers work and hence see if they are slacking. I think Church's real critique is about management attitude and organizational cultures, rather than the Agile methods themselves. 


Thursday, June 4, 2015

Introducing Agile Management in my world

As you may know, agile programming and agile project management methodologies are now quite popular. There are a variety of different methods that have come out in the last ten years. Some of the more popular ones I have looked at include Kanban and Scrum, and some of the associated software products I have seen include Trello, Jira, and Personal Kanban. I have seen some interesting articles on how to use these software management ideas to manage your personal life. See personal kanban 101. In my home, my wife and I have set up our own Kanban board just to manage our personal goals and projects. Someone I know at Nike says the use scrums in their shoe innovation teams.

As CTO at an Investment Bank, I am in charge of all development efforts for new software. Our CIO is a font of ideas who bubbles up features, ideas and projects at an outstanding pace. There was no way to implement all of these ideas. I find myself picking and choosing which of these ideas to implement, since it was impossible to implement them all. Meanwhile the research team using our software had their list of improvements and bugs to be addressed. Until 2013, I found that our system of managing all of these requests was to informal and was not fostering a team approach to setting our technology goals. The CIO was frustrated that I was the dragon at the gate filtering all innovation. The research team in frustration would only submit there most urgent requests and bug reports. In addition, we outsourced some software projects which were late and went over budget. As a result, I was feeling overwhelmed by requests, and felt unable to respond effectively to people who wanted help.

The solution we found was to adopt scrum and use Jira. This allowed us to log all of the technology requests and into a backlog. Then as a team we could review all the requests and set up focused Sprints to address specific problems. Instead of having myself prioritize what needed to be done, the team could set the priorities, with each member advocating for features they wanted.

Some of the hiccups we encountered in this transition included the Jira Software. Though it does effectively cover all you need to run scrums and sprints and to maintain a backlog, some of our users really hated the interface. The heart of Jira are the issues database, and in this software you can see this issues from many different views, boards, search results, etc. The problem is that you can't always edit an issue item depending on the view you see it in, and this became frustrating for users. I also used Trello which was a lot more friendlier in its layout and edit-ability, but was better suited for Kanban, than scrum.

Since some of the users balked at Jira, we found the key to making adoption all the scrum method work was to have a scrum master. We picked one of our researchers who was interested in software development to take a 2 day scrum master course. He then became our scrum master, and championed the adoption. He made sure we had our meetings, and made sure that issues that were brought up in meetings or informally were put into the back log, and also to keep us disciplined. He recently left, and we have now picked another of our research team to take up the task.

Before this adoption, we did manage to create quite a bit of useful software. However, as our systems became more mature and complex, we got bogged down by the maturity of our software which made any new innovations difficult since they had to marry into an existing complex system. We then needed scrum to become agile again.

Scrum became so successful, that Banking side of our business started to adopt scrums too!

Longtime absence and more to come

Hello everyone. I just wanted to say that I am back. I have been on hiatus for awhile. New changes in my personal life including a new baby daughter, Adeline aka Ada (partial named after Ada Lovelace) and also a new house in the 'burbs of New York. More to come...