\documentclass{article} \setlength{\textwidth}{6.75in} % 1 inch side margins \setlength{\textheight}{9in} % ~1 inch top and bottom margins \setlength{\headheight}{0in} \setlength{\topmargin}{0in} \setlength{\headsep}{0in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \title{Botan Internals} \author{Jack Lloyd (lloyd@randombit.net)} \date{August 20, 2006} \newcommand{\filename}[1]{\texttt{#1}} \newcommand{\manpage}[2]{\texttt{#1}(#2)} \newcommand{\function}[1]{\textbf{#1}} \newcommand{\type}[1]{\texttt{#1}} \renewcommand{\arg}[1]{\textsl{#1}} \begin{document} \maketitle \tableofcontents \parskip=5pt \section{Introduction} This document is intended to document some of the trickier and/or more complicated parts of Botan. This is not going to be terribly useful if you just want to use the library, but for people wishing to understand how it works, or contribute new code to it, it will hopefully prove helpful. I've realized that a lot of things Botan does internally are pretty hard to understand, and that a lot of things are only inside my head, which is a bad place for them to be (things tend to get lost in there, not to mention the possibility that I'll get hit by a truck next week). This document is currently very incomplete. I'll be working on it as I have time. \pagebreak \section{Filter} \type{Filter} is one of the core abstractions of the library. It is used to represent any sort of transformation. Nearly all \type{Filter}s are linear; they take input from a single source and send their output (if any) to another single \type{Filter}. The one exception is \type{Fanout\_Filter}, which uses friend access to \type{Filter} in order to allow for multiple \type{Filter}s to attach to its output. This special access is used by the Chain and Fork filters; Chain encapsulates one or more \type{Filter}s into a single Filter, and Fork sends its input to a set of several \type{Filter} objects. The majority of the relations between filters is maintained by the \type{Pipe} object which ``owns'' the \type{Filter}s. \section{Pipe} \type{Pipe} is, conceptually, a tree structure of \type{Filter} objects. There is a single unique top, and an arbitrary number of leaves (which are \type{SecureQueue} objects). \type{SecureQueue} is a simple \type{Filter} that buffers its input. Writing into the pipe writes into the top of the tree. The filter at the top of the tree writes its output into the next \type{Filter}, and so on until eventually data trickles down into the bottommost \type{Filter}s, where the data is stored for later retrieval. When a new message is started, \type{Pipe} searches through the tree of \type{Filter}s and finds places where the \arg{next} field of the \type{Filter} is NULL. This implies that it was the lowest layer of the \type{Filter} tree that the user added. It then adds \type{SecureQueue} objects onto these \type{Filter}s. These queues are also stored in an deque; this is so \type{Pipe} can read from them later without doing a tree traversal each time. \type{Pipe} will, if asked, destroy the existing tree structure, in order to create a new one. However, the queue objects are not deleted, because \type{Pipe} might be asked to read from them later (while \type{Pipe} could delete all the messages in this case, the principle of least astonishment suggested keeping them). What I wrote about \type{Pipe} keeing the queues in a deque is a lie. Sort of. It keeps them in an object called \type{Output\_Buffers}, which keeps them in a deque. \type{Output\_Buffers} is intended to abstract away how message queues are stored from \type{Pipe}. After a queue has been added to the output buffers object, \type{Pipe} keeps no references to it whatsoever; all access is mediated by the \type{Output\_Buffers}. This allows queues which have been read to be deleted, rather than leaving empty queue objects all over the place. \section{Library Initialization} WRITEME \section{Lookup Mechanism} Most objects know their name, and they know how to create a new copy of themselves. We build mapping tables that map from an algorithm name into a single instance of that algorithm. The tables themselves can be found in \filename{src/lookup.cpp}. There are a set of functions named \function{add\_algorithm} that can be used to populate the tables. We get something out of the table with \function{retrieve\_x}, where x is the name of a type (\texttt{block\_cipher}, \texttt{hash}, etc). This returns a const pointer to the single unique instance of the algorithm that the lookup tables know about. If it doesn't know about it, it falls back on calling a function called \function{try\_to\_get\_x}. These functions live in \filename{src/algolist.cpp}. They are mostly used to handle algorithms which need (or at least can have) arguments passed to them, like \type{HMAC} and \type{SAFER\_SK}. It will return NULL if it can't find the algorithm at all. When it's asked for an algorithm it doesn't know about (ie, isn't in the mapping tables), the retrieval functions will ask the try-to-get functions if \emph{they} know about it. If they do, then the object returned will be stored into the table for later retrieval. The functions \function{get\_x} call the retrieval functions. If we get back NULL, an exception is thrown. Otherwise it will call the \function{clone} method to get a new copy of the algorithm, which it returns. The various functions like \function{output\_length\_of} call the retrieval function for each type of object that the parameter in question (in this case, \texttt{OUTPUT\_LENGTH}) might be meaningful for. If it manages to get back an object, it will return (in this case) the \texttt{OUTPUT\_LENGTH} field of the object. No allocations are required to call this function: all of its operations work directly on the copies living in the lookup tables. \section{Allocators} A big (slow) mess. \section{BigInt} Read ``Handbook of Applied Cryptography''. \section{PEM/BER Identification} We have a specific algorithm for figuring out if something is PEM or BER. Previous versions (everything before 1.3.0) requried that the caller specify which one it was, and they had to be right. Now we use a hueristic (aka, an algorithm that sometimes doesn't work right) to figure it out. If the first character is not 0x30 (equal to ASCII '0'), then it can't possibly be BER (because everything we care about is enclosed in an ASN.1 SEQUENCE, which for BER/DER is encoded as beginning with 0x30). Roughly 99.9% of PEM blocks \emph{won't} have a random 0 character in front of them, so we are mostly safe (unless someone does it on purpose, in which case, please hit them for me). But to be sure, if there is a 0, then we search the first \emph{N} bytes of the block for the string ``-----BEGIN ``, which marks the typical start of a PEM block. The specific \emph{N} depends on the variable ``base/pem\_search'', which defaults to 4 kilobytes. So, you can actually fool it either way: that a PEM file is really BER, or that a BER file is actually PEM. To fool it that a BER file is PEM, just have the string ``-----BEGIN `` somewhere (I can't imagine this string shows up in certificates or CRLs too often, so if it is there it means somebody is being a jerk). If a file starts with 0 and has at least ``base/pem\_search'' byte more junk in the way, it won't notice that its PEM at all. In either case, of course, the loading will fail, and you'll get a nice exception saying that the decoding failed. \end{document}