merge in other stuff real quick

This commit is contained in:
Brian Picciano 2013-10-16 20:21:40 -04:00
commit 98245f1639
7 changed files with 453 additions and 71 deletions

View File

@ -1,72 +1,9 @@
# Lagom
This is my here blog. It's not much at the moment (one post? booyah!), but maybe it'll grow.
> #### *Lagom* is a Swedish word with no direct English equivalent, meaning "just the right amount"
Maybe not
Lagom, a [Jekyll][j] blog theme with just the right amount of style.
* [Erlang, tcp sockets, and active true](erlang-tcp-socket-pull-pattern.md) (originally posted March 9, 2013)
* [go+](goplus.md) (originally posted July 11, 2013)
* [Generations](generations.md) (originally posted October 8, 2013)
Extracted lovingly from [http://mdswanson.com][mds] for your enjoyment!
* Responsive, based on [Skeleton][skeleton]
* [Font Awesome][font-awesome] for icons
* Open Sans from [Google web fonts][gfonts]
* Built-in Atom RSS feed
## Action Shots
![](http://i.imgur.com/Pmzk4j1.png)
![](http://i.imgur.com/CT2Xvug.png)
![](http://i.imgur.com/XisjqW1.jpg)
## Installation
- Install Jekyll: `gem install jekyll`
- [Fork this repository][fork]
- Clone it: `git clone https://github.com/YOUR-USER/lagom`
- Run the jekyll server: `jekyll serve`
You should have a server up and running locally at <http://localhost:4000>.
## Customization
Next you'll want to change a few things. Most of them can be changed directly in
[_config.yml][config]. That's where you can add your social links, change the accent
color, stuff like that.
There's a few other places that you'll want to change, too:
- [CNAME][cname]: If you're using this on GitHub Pages with a custom domain name,
you'll want to change this to be the domain you're going to use. All that should
be in here is a domain name on the first line and nothing else (like: `example.com`).
- [favicon.png][favicon]: This is the icon in your browser's address bar. You should
change it to whatever you'd like.
- [logo.png][logo]: A square-ish image that appears in the upper-left corner
## Deployment
You should deploy with [GitHub Pages][pages] - it's just easier.
All you should have to do is rename your repository on GitHub to be
`username.github.io`. Since everything is on the `gh-pages` branch, you
should be able to see your new site at <http://username.github.io>.
## Licensing
[MIT](https://github.com/swanson/lagom/blob/master/LICENSE) with no
added caveats, so feel free to use this on your site without linking back to
me or using a disclaimer or anything silly like that.
## Contact
I'd love to hear from you at [@_swanson][twitter]. Feel free to open issues if you
run into trouble or have suggestions. Pull Requests always welcome.
[j]: http://jekyllrb.com/
[mds]: http://mdswanson.com
[skeleton]: http://www.getskeleton.com/
[font-awesome]: http://fortawesome.github.io/Font-Awesome/
[gfonts]: http://www.google.com/fonts/specimen/Open+Sans
[fork]: https://github.com/swanson/lagom/fork
[config]: https://github.com/swanson/lagom/blob/master/_config.yml
[cname]: https://github.com/swanson/lagom/blob/master/CNAME
[favicon]: https://github.com/swanson/lagom/blob/master/favicon.png
[logo]: https://github.com/swanson/lagom/blob/master/logo.png
[pages]: http://pages.github.com
[twitter]: https://twitter.com/_swanson
That's all folks!

View File

@ -0,0 +1,252 @@
# Erlang, tcp sockets, and active true
If you don't know erlang then [you're missing out][0]. If you do know erlang,
you've probably at some point done something with tcp sockets. Erlang's highly
concurrent model of execution lends itself well to server programs where a high
number of active connections is desired. Each thread can autonomously handle its
single client, greatly simplifying the logic of the whole application while
still retaining [great performance characteristics][1].
# Background
For an erlang thread which owns a single socket there are three different ways
to receive data off of that socket. These all revolve around the `active`
[setopts][2] flag. A socket can be set to one of:
* `{active,false}` - All data must be obtained through [recv/2][3] calls. This
amounts to syncronous socket reading.
* `{active,true}` - All data on the socket gets sent to the controlling thread
as a normal erlang message. It is the thread's
responsibility to keep up with the buffered data in the
message queue. This amounts to asyncronous socket reading.
* `{active,once}` - When set the socket is placed in `{active,true}` for a
single packet. That is, once set the thread can expect a
single message to be sent to when data comes in. To receive
any more data off of the socket the socket must either be
read from using [recv/2][3] or be put in `{active,once}` or
`{active,true}`.
# Which to use?
Many (most?) tutorials advocate using `{active,once}` in your application
\[0]\[1]\[2]. This has to do with usability and security. When in `{active,true}`
it's possible for a client to flood the connection faster than the receiving
process will process those messages, potentially eating up a lot of memory in
the VM. However, if you want to be able to receive both tcp data messages as
well as other messages from other erlang processes at the same time you can't
use `{active,false}`. So `{active,once}` is generally preferred because it
deals with both of these problems quite well.
# Why not to use `{active,once}`
Here's what your classic `{active,once}` enabled tcp socket implementation will
probably look like:
```erlang
-module(tcp_test).
-compile(export_all).
-define(TCP_OPTS, [
binary,
{packet, raw},
{nodelay,true},
{active, false},
{reuseaddr, true},
{keepalive,true},
{backlog,500}
]).
%Start listening
listen(Port) ->
{ok, L} = gen_tcp:listen(Port, ?TCP_OPTS),
?MODULE:accept(L).
%Accept a connection
accept(L) ->
{ok, Socket} = gen_tcp:accept(L),
?MODULE:read_loop(Socket),
io:fwrite("Done reading, connection was closed\n"),
?MODULE:accept(L).
%Read everything it sends us
read_loop(Socket) ->
inet:setopts(Socket, [{active, once}]),
receive
{tcp, _, _} ->
do_stuff_here,
?MODULE:read_loop(Socket);
{tcp_closed, _}-> donezo;
{tcp_error, _, _} -> donezo
end.
```
This code isn't actually usable for a production system; it doesn't even spawn a
new process for the new socket. But that's not the point I'm making. If I run it
with `tcp_test:listen(8000)`, and in other window do:
```bash
while [ 1 ]; do echo "aloha"; done | nc localhost 8000
```
We'll be flooding the the server with data pretty well. Using [eprof][4] we can
get an idea of how our code performs, and where the hang-ups are:
```erlang
1> eprof:start().
{ok,<0.34.0>}
2> P = spawn(tcp_test,listen,[8000]).
<0.36.0>
3> eprof:start_profiling([P]).
profiling
4> running_the_while_loop.
running_the_while_loop
5> eprof:stop_profiling().
profiling_stopped
6> eprof:analyze(procs,[{sort,time}]).
****** Process <0.36.0> -- 100.00 % of profiled time ***
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
prim_inet:type_value_2/2 6 0.00 0 [ 0.00]
....snip....
prim_inet:enc_opts/2 6 0.00 8 [ 1.33]
prim_inet:setopts/2 12303599 1.85 1466319 [ 0.12]
tcp_test:read_loop/1 12303598 2.22 1761775 [ 0.14]
prim_inet:encode_opt_val/1 12303599 3.50 2769285 [ 0.23]
prim_inet:ctl_cmd/3 12303600 4.29 3399333 [ 0.28]
prim_inet:enc_opt_val/2 24607203 5.28 4184818 [ 0.17]
inet:setopts/2 12303598 5.72 4533863 [ 0.37]
erlang:port_control/3 12303600 77.13 61085040 [ 4.96]
```
eprof shows us where our process is spending the majority of its time. The `%`
column indicates percentage of time the process spent during profiling inside
any function. We can pretty clearly see that the vast majority of time was spent
inside `erlang:port_control/3`, the BIF that `inet:setopts/2` uses to switch the
socket to `{active,once}` mode. Amongst the calls which were called on every
loop, it takes up by far the most amount of time. In addition all of those other
calls are also related to `inet:setopts/2`.
I'm gonna rewrite our little listen server to use `{active,true}`, and we'll do
it all again:
```erlang
-module(tcp_test).
-compile(export_all).
-define(TCP_OPTS, [
binary,
{packet, raw},
{nodelay,true},
{active, false},
{reuseaddr, true},
{keepalive,true},
{backlog,500}
]).
%Start listening
listen(Port) ->
{ok, L} = gen_tcp:listen(Port, ?TCP_OPTS),
?MODULE:accept(L).
%Accept a connection
accept(L) ->
{ok, Socket} = gen_tcp:accept(L),
inet:setopts(Socket, [{active, true}]), %Well this is new
?MODULE:read_loop(Socket),
io:fwrite("Done reading, connection was closed\n"),
?MODULE:accept(L).
%Read everything it sends us
read_loop(Socket) ->
%inet:setopts(Socket, [{active, once}]),
receive
{tcp, _, _} ->
do_stuff_here,
?MODULE:read_loop(Socket);
{tcp_closed, _}-> donezo;
{tcp_error, _, _} -> donezo
end.
```
And the profiling results:
```erlang
1> eprof:start().
{ok,<0.34.0>}
2> P = spawn(tcp_test,listen,[8000]).
<0.36.0>
3> eprof:start_profiling([P]).
profiling
4> running_the_while_loop.
running_the_while_loop
5> eprof:stop_profiling().
profiling_stopped
6> eprof:analyze(procs,[{sort,time}]).
****** Process <0.36.0> -- 100.00 % of profiled time ***
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
prim_inet:enc_value_1/3 7 0.00 1 [ 0.14]
prim_inet:decode_opt_val/1 1 0.00 1 [ 1.00]
inet:setopts/2 1 0.00 2 [ 2.00]
prim_inet:setopts/2 2 0.00 2 [ 1.00]
prim_inet:enum_name/2 1 0.00 2 [ 2.00]
erlang:port_set_data/2 1 0.00 2 [ 2.00]
inet_db:register_socket/2 1 0.00 3 [ 3.00]
prim_inet:type_value_1/3 7 0.00 3 [ 0.43]
.... snip ....
prim_inet:type_opt_1/1 19 0.00 7 [ 0.37]
prim_inet:enc_value/3 7 0.00 7 [ 1.00]
prim_inet:enum_val/2 6 0.00 7 [ 1.17]
prim_inet:dec_opt_val/1 7 0.00 7 [ 1.00]
prim_inet:dec_value/2 6 0.00 10 [ 1.67]
prim_inet:enc_opt/1 13 0.00 12 [ 0.92]
prim_inet:type_opt/2 19 0.00 33 [ 1.74]
erlang:port_control/3 3 0.00 59 [ 19.67]
tcp_test:read_loop/1 20716370 100.00 12187488 [ 0.59]
```
This time our process spent almost no time at all (according to eprof, 0%)
fiddling with the socket opts. Instead it spent all of its time in the
read_loop doing the work we actually want to be doing.
# So what does this mean?
I'm by no means advocating never using `{active,once}`. The security concern is
still a completely valid concern and one that `{active,once}` mitigates quite
well. I'm simply pointing out that this mitigation has some fairly serious
performance implications which have the potential to bite you if you're not
careful, especially in cases where a socket is going to be receiving a large
amount of traffic.
# Meta
These tests were done using R15B03, but I've done similar ones in R14 and found
similar results. I have not tested R16.
* \[0] http://learnyousomeerlang.com/buckets-of-sockets
* \[1] http://www.erlang.org/doc/man/gen_tcp.html#examples
* \[2] http://erlycoder.com/25/erlang-tcp-server-tcp-client-sockets-with-gen_tcp
[0]: http://learnyousomeerlang.com/content
[1]: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1
[2]: http://www.erlang.org/doc/man/inet.html#setopts-2
[3]: http://www.erlang.org/doc/man/gen_tcp.html#recv-2
[4]: http://www.erlang.org/doc/man/eprof.html

95
generations.md Normal file
View File

@ -0,0 +1,95 @@
# Generations
A simple file distribution strategy for very large scale, high-availability
file-services.
# The problem
Working at a shop where we have millions of different files, any of which could
be arbitrarily chosen to serve to a file at any given time. These files are
uploaded by users of the app and retrieved by others.
Scaling such a system is no easy task. The chosen solution involves shuffling
files around on a nearly constant basis, making sure that files which are more
"popular" are on fast drives, while at the same time making sure that no drives
are at capicty and at the same time that all files, even newly uploaded ones,
are stored redundantly.
The problem with this solution is one of coordination. At any given moment the
app needs to be able to "find" a file so it can give the client a link to
download the file from one of the servers that it's on. Full-filling this simple
requirement means that all datastores/caches where information about where a
file lives need to be up-to-date at all times, and even then there are
race-conditions and network failures to contend with, while at all times the
requirements of the app evolve and change.
# A simpler solution
Let's say you want all files which get uploaded to be replicated in triplicate
in some capacity. You buy three identical hard-disks, and put each on a separate
server. As files get uploaded by clients, each file gets put on each drive
immediately. When the drives are filled (which should be at around the same
time), you stop uploading to them.
That was generation 0.
You buy three more drives, and start putting all files on them instead. This is
going to be generation 1. Repeat until you run out of money.
That's it.
## That's it?
It seems simple and obvious, and maybe it's the standard thing which is done,
but as far as I can tell no-one has written about it (though I'm probably not
searching for the right thing, let me know if this is the case!).
## Advantages
* It's so simple to implement, you could probably do it in a day if you're
starting a project from scratch
* By definition of the scheme all files are replicated in multiple places.
* Minimal information about where a file "is" needs to be stored. When a file is
uploaded all that's needed is to know what generation it is in, and then what
nodes/drives are in that generation.
* Drives don't need to "know" about each other. What I mean by this is that
whatever is running as the receive point for file-uploads on each drive doesn't
have to coordinate with its siblings running on the other drives in the
generation. In fact it doesn't need to coordinate with anyone. You could
literally rsync files onto your drives if you wanted to. I would recommend using
[marlin][0] though :)
* Scaling is easy. When you run out of space you can simply start a new
generation. If you don't like playing that close to the chest there's nothing to
say you can't have two generations active at the same time.
* Upgrading is easy. As long as a generation is not marked-for-upload, you can
easily copy all files in the generation into a new set of bigger, badder drives,
add those drives into the generation in your code, remove the old ones, then
mark the generation as uploadable again.
* Distribution is easy. You just copy a generation's files onto a new drive in
Europe or wherever you're getting an uptick in traffic from and you're good to
go.
* Management is easy. It's trivial to find out how many times a file has been
replicated, or how many countries it's in, or what hardware it's being served
from (given you have easy access to information about specific drives).
## Caveats
The big caveat here is that this is just an idea. It has NOT been tested in
production. But we have enough faith in it that we're going to give it a shot at
cryptic.io. I'll keep this page updated.
The second caveat is that this scheme does not inherently support caching. If a
file suddenly becomes super popular the world over your hard-disks might not be
able to keep up, and it's probably not feasible to have an FIO drive in *every*
generation. I think that [groupcache][1] may be the answer to this problem,
assuming your files are reasonably small, but again I haven't tested it yet.
[0]: https://github.com/cryptic-io/marlin
[1]: https://github.com/golang/groupcache

73
goplus.md Normal file
View File

@ -0,0 +1,73 @@
# Go and project root
Compared to other languages go has some strange behavior regarding its project
root settings. If you import a library called `somelib`, go will look for a
`src/somelib` folder in all of the folders in the `$GOPATH` environment
variable. This works nicely for globally installed packages, but it makes
encapsulating a project with a specific version, or modified version, rather
tedious. Whenever you go to work on this project you'll have to add its path to
your `$GOPATH`, or add the path permanently, which could break other projects
which may use a different version of `somelib`.
My solution is in the form of a simple script I'm calling go+. go+ will search
in currrent directory and all of its parents for a file called `GOPROJROOT`. If
it finds that file in a directory, it prepends that directory's absolute path to
your `$GOPATH` and stops the search. Regardless of whether or not `GOPROJROOT`
was found go+ will passthrough all arguments to the actual go call. The
modification to `$GOPATH` will only last the duration of the call.
As an example, consider the following:
```
/tmp
/hello
GOPROJROOT
/src
/somelib/somelib.go
/hello.go
```
If `hello.go` depends on `somelib`, as long as you run go+ from `/tmp/hello` or
one of its children your project will still compile
Here is the source code for go+:
```bash
#!/bin/sh
SEARCHING_FOR=GOPROJROOT
ORIG_DIR=$(pwd)
STOPSEARCH=0
SEARCH_DIR=$ORIG_DIR
while [ $STOPSEARCH = 0 ]; do
RES=$( find $SEARCH_DIR -maxdepth 1 -type f -name $SEARCHING_FOR | \
grep -P "$SEARCHING_FOR$" | \
head -n1 )
if [ "$RES" = "" ]; then
if [ "$SEARCH_DIR" = "/" ]; then
STOPSEARCH=1
fi
cd ..
SEARCH_DIR=$(pwd)
else
export GOPATH=$SEARCH_DIR:$GOPATH
STOPSEARCH=1
fi
done
cd "$ORIG_DIR"
exec go $@
```
# UPDATE: Goat
I'm leaving this post for posterity, but go+ has some serious flaws in it. For
one, it doesn't allow for specifying the version of a dependency you want to
use. To this end, I wrote [goat][0] which does all the things go+ does, plus
real dependency management, PLUS it is built in a way that if you've been
following go's best-practices for code organization you shouldn't have to change
any of your existing code AT ALL. It's cool, check it out.
[0]: http://github.com/mediocregopher/goat

Binary file not shown.

View File

@ -1,2 +0,0 @@
_site/
.DS_Store

27
res/go+ Executable file
View File

@ -0,0 +1,27 @@
#!/bin/sh
SEARCHING_FOR=GOPROJROOT
ORIG_DIR=$(pwd)
STOPSEARCH=0
SEARCH_DIR=$ORIG_DIR
while [ $STOPSEARCH = 0 ]; do
RES=$( find $SEARCH_DIR -maxdepth 1 -type f -name $SEARCHING_FOR | \
grep -P "$SEARCHING_FOR$" | \
head -n1 )
if [ "$RES" = "" ]; then
if [ "$SEARCH_DIR" = "/" ]; then
STOPSEARCH=1
fi
cd ..
SEARCH_DIR=$(pwd)
else
export GOPATH=$SEARCH_DIR:$GOPATH
STOPSEARCH=1
fi
done
cd "$ORIG_DIR"
exec go $@