Kyle Lahnakoski

Just One More Layer of Indirection
(Trying to achieve stable orbit with sufficient architecture)

Archive for the ‘Coding’ Category

Missing Step Zero

Wednesday, December 9th, 2009

Why do instructions seem to miss the step before the first? I will call this missing step “Missing Step Zero”, if you will.

Maybe I am just bad at Googling. Maybe I have a problem with directions. Or, maybe I have a non-standard setup, but the missing step zero has plagued me several times in my life.

Now, I am too forgetful to remember all the Missing Step Zeros I had to hunt down but her are two:

“How do I write a file in Oracle’s PL/SQL?”

But none of them work!

After hours of investigation I found the missing step zero:

Make sure sysdba grants you permission to execute the UTL_FILE package

GRANT EXECUTE ON UTL_FILE TO <username>

I wish I read *this* first

“How do I turn on WCF logging?”

And there are hosts of other sites with various logging options. But none of them work!

After a few days, and help from a friend, I found the sneaky step zero:

Only the app.config file in the executable subproject is used to configure logging. All other subproject config files are ignored.

Seriously? How about generating an error when an app.config file is not going to be used?

Are Ad Servers Bogging Down the Web?

Monday, November 30th, 2009

Slashdot brings up a point I complain about: Ad servers are slowing down the web.

I do not use web applications because they are slow. I do not know what people do to pass the time when they wait for each page to load. Using web mail, and adding an attachment makes you feel like you wasted precious time.

The web is mostly slow because of server latency. Especially “waiting for …” whatever ad server has been bogged down. I particularly dislike the sites that also use the slow Google Analytics servers.

Type Transformation Library. In Java!

Saturday, November 28th, 2009

I have just read an interesting post on LtU, which asks for a type-class transformation library. And it reminded me of wanting the same thing. I had not considered making these features into a stand-alone project. This is perfect for a project:

  1. The features are definitely useful, I have had to build some portions of this library for myself. I would have been happy to have a library that did this for me.
  2. The projects has a finite size: As long as we limit the number of forms we can transform between, the number of transformations are finite. Certainly, choosing the top four, or five commons forms will make a useful library.
  3. Adding forms is perfect for the open source community to contribute: The overall structure of the API would be clearly defined, and people can add their own transformations without knowing the details of the bigger project. Limiting scope of the task, and making it manageable.
  4. Much of the heavy lifting has been done: In various personal libraries of code, and in the open source community, these transformations exist already. All that remains is patching the disparate parts into a normalized, clean API
  5. The type transform API should be normalized and complete (any type to any other type) so it is easy to learn. This may demand us to implement non-useful transformations, or worse, annotate forms that can not support the richness that some forms can.

Try an Index instead of Changing Your Infrastructure

Sunday, November 15th, 2009

One thing that disturbs me is the proliferation evil agents who love key-value stores. Especially those that love key-value stores in a latency-infested cloud. What upsets me more are the infinitely confused people who believe a database is *worse* than their key-value storage.

Here is one where Ian prefers Cassandra over proper database indexes:

For some reason, Ian has compared his terrible query to his optimized Cassandra implementation. The query (and schema) are so bad, I suspect it’s a straw man.

Ian does not provide the SQL which makes him conclude that “Computing the intersection with a JOIN is much too slow in MySQL, so we have to do it in PHP.”. Any statement that implies a join is done faster outside the database should set of warning bells: The database should have all the information required to make your queries fast. If this is not the case, then something is seriously wrong with your indexes.

An all-database solution, even if it is a stored procedure, will be faster than a networked solution just because of latency. Personally, I have found returning a few hundred extra rows from a single “close enough” query significantly faster than issuing two queries with perfect results: Latency is your biggest enemy.

Let’s look at the Digg schema provided:

CREATE TABLE 'Diggs' (
  'id'      INT(11),
  'itemid'  INT(11),
  'userid'  INT(11),
  'digdate' DATETIME,
  PRIMARY KEY ('id'),
  KEY 'user'  ('userid'),
  KEY 'item'  ('itemid')
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
 

CREATE TABLE 'Friends' (
  'id'           INT(10) AUTO_INCREMENT,
  'userid'       INT(10),
  'username'     VARCHAR(15),
  'friendid'     INT(10),
  'friendname'   VARCHAR(15),
  'mutual'       TINYINT(1),
  'date_created' DATETIME,
  PRIMARY KEY                ('id'),

  UNIQUE KEY 'Friend_unique' ('userid','friendid'),
  KEY        'Friend_friend' ('friendid')

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Some changes to the indexes would help:

  1. KEY ‘user’ (‘userid’) – does not help much when a user has digged many items: The index will help by pointing to all the actual ‘Diggs’ records, but the database will have to load every one of those blocks from disk to get that information (very likely one block per record). I would have suggested UNIQUE KEY ‘user’ (‘userid’, ‘itemid’, ‘digdate’) – This would have allowed the query to simply use the index, and not have to go back to the massive, unsorted, ‘Diggs’ table.
  2. UNIQUE KEY ‘Friend_unique’ (‘userid’,‘friendid’) – Seems to be the correct index to for “Query Friends for all my friends.”; this should be a single block lookup. There is no reason this should take 1.5seconds.
  3. KEY ‘Friend_friend’ (‘friendid’) – Maybe instead, Ian intended to have a list of all users that made ‘me’ a friend, rather than all users ‘I’ have befriended. This certainly explains the 1.5sec response time. In this case, the index should be expanded so the table blocks do not need to be loaded: UNIQUE KEY ‘Friend_friend’ (‘friendid’, ‘userid’).
  4. Maybe MySQL is poor database and loads the original records during a query even if the columns are not needed

Anyone who may be complaining about the extra disk space required to write these indexes should note that Ian’s Cassandra implementation consumes much more space than I am advocating here.

Even *IF* the Digg database is so big that the index lookups take too long, we should realize that we can pre-compute query results in the database, just like in Ian’s Cassandra implementation. If the database does not have materialized views, we can always add triggers to do the job ourselves. The former is still a limited technology, and the latter is quite messy, but both are better than changing your whole platform.

Finally, it seems Ian is trying to optimize for the worst case: “Kevin Rose, for example, has 40,000 followers”. I disagree with changing your infrastructure for a single use case for a minority of users, but that is a business decision that involves more issues than Ian’s blog entry can be expected to consider.

In conclusion, I am angry that the human race has lost another soul to the legion of key-value fanatics. I am further incensed that apparently 298 other nameless souls have followed Ian into the pits of hell. (298 diggs at time of writing).

PI

Thursday, October 29th, 2009

I have a small programming project, called YAY, which I work on occasionally.  The objective of YAY is to be a type-safe and easy to use parser-generator-and-compiler.  The parser-generator is the easy part.  The compiler portion is more difficult.  Specifically, I am adding namespace processing so that the parsers specified are able to generate general graphs, and not just trees.  In theory, YAY should be able to parse simple “languages” like XML and HTML, including URLs and XML namespaces, with no post-processing.

Programming languages, like Java, should require macro definitions to become full compilers. Unfortunately, macro definitions have not been added to YAY yet.

Today I have discovered π (PI), which is very much like YAY.  This is good news because I do not particularly enjoy building YAY, I only like using it.  If π (PI) can replace YAY, then I can have someone else do the hard work, while I play at a higher abstraction level.  π (PI) looks interesting because it seems to have avoided YAYs intermediate parse tree representation, and seems to go directly to macro (re)writing. 

But, I am suspicious whether it works. 

Now, I could be wrong, but YAY is complicated for a reason:  It must allow identical language constructs inside different contexts to mean different things.  For example

    A: for (Object o : MyList){

      for (Object p : MyOtherList){

        if (something) continue A;

      }//for

    }//for

In this case the continue A; refers to scope that is ‘far’ from itself, and the same sequence of bytes can also refer to a different exit point of another loop later in the program. I am also thinking that exception handing scope can be more complicated.

It is not obvious how π (PI) achieves this non-local syntax specification.

In any case, language specification is one of my favorite subjects.  I am compelled to review the π (PI) implementation despite it being in a pre-alpha state.

SQL Databases Could Scale

Sunday, October 4th, 2009

Adam Wiggins says that SQL Databases Don’t Scale.   Some of the comments there mention Oracle RAC being able to scale quite well, but I do not know if RAC still has a scaling limit, albeit higher.   Maybe Oracle’s RAC limit is effectively infinity (much like no one will need more than 640K ram), in which case RAC solves our problems, and the discussion is complete.

But, let’s assume SQL databases do not scale now, I believe they can be scaled without changes to the application logic.

The solution can be found in Sharding; which means partitioning the data between servers according to access patterns.  I propose automatic sharding which will take the database requests, at the client end, and redirect those requests to the machine(s) with the required shard of data.

Automatic sharding should be completely possible:  Database constraints reveal the strongly connected data, but also reveal the natural break lines in that data.  Application access patterns (from profiling) can provide evidence of about what tables do not change often, and what data is ripe for replication.

For example, the automatic sharder should “see” that partition by user id is effective because the data dependencies between users is quite small.   Furthermore, the mutually dependent portion of the database will consist of lookup tables, and other rarely changed data; which can be replicated given the few times it changes.

The relational database was designed around the independence of rows.  This row independence is necessary for highly parallel operations, which is exactly what sharding needs.   If it is true that the database community has been “trying to solve a problem for twenty years and still haven’t managed to come up with an obvious solution”, then I am dismayed:  After sharding a database or two, it should be obvious how to automate the sharding.

ASCIIMathML

Monday, August 10th, 2009

Finally, a simple mathematical presentation tool for the web: ASCIIMathML

Check out the awesomeness:

`"ERROR" = sum_("over all Just Notes") (( "note" ^("round"(ln( "note" )/ln( "spacing" ))) - "note") / "note" /( "max"^|"power"\|))^2`

Thanks again to WordPress to make using this Javascript library just a little bit harder.

Page’s Law

Monday, June 1st, 2009

Finally, a name for what I am interested in defeating:  Page’s Law.  I advocate a proliferation of domain specific languages, and higher level languages to help the programmer become more productive.  Unfortunatly, these high-level laguages make less (compile-time) assumptions and run much slower.

I do not know what Google has planned, but my plan is to focus on optimizing the abstract high level set-operations, and optimize the common compiler and interpreter operations.  This is nothing new, just they have not been done in a single system, during runtime, before.

Awesome Weekend!

Sunday, May 17th, 2009

Last time I spent a weekend hacking was a few years ago, in a tent. Back then, I converted a diff algorithm from C to Java, and put a friendly diff/merge API on top of it. This weekend is even better, because it is almost three days of no interruption, and no time was wasted traveling to and from a tent.

And, serendipitously: Finding The Coding Zone: Your Perfect Trifecta?

This weekend I am working on the next version of YAY, a parser specification language. This next version will be able to generate general parse graphs, as opposed to parse trees. This is good for the declarative aspect of some languages.

General parse graphs are generated from parse trees using some namespace as a reference to connect nodes of that tree. Essentially, any node can be given a name, and that name can be used by other nodes to refer to it.

The namespace is not flat, but rather can be transformed by the nodes to reflect the varying contexts represented by the parse tree. As you can imagine “this” in one part of the tree refers to a very different object then “this” in another part of the tree. Namespace transformations are used to portray changes in scope

I must also mention that the namespaces are limited to the context of the source code. This means namespaces can not be used to refer to objects that exist only in the runtime. I hope that this will not be too limiting: Classes, methods, and properties are all realized in the source code, so can be “named”. Futhermore, with a little trickery, prototypical instances can also be named; by assuming they occupy some part of the original source code: eg “this”.

Each node has access to namespace of local nodes to help derive it’s own namespace. The geeky conclusion is that the namespace dependencies must go though a dependency analysis so all nodes are calculated in the correct order. Yay! Dependency analysis!

I Promise …

Saturday, May 16th, 2009

I promise to tone down the political and economic comments.  After all, they have a limited lifespan.  I will post my comments to other blogs where appropriate, just like everyone else.

Also, I promise to increase the amount of code blogging.  These blogs will just be geeky thing I have done, but at least they will not age as fast as a political blog.

Finally, I have changed the name of my blog to better reflect my passion:  To build software so complicated I can not understand it!

kyle@arcavia.com