commit
98245f1639
@ -1,72 +1,9 @@ |
||||
# Lagom |
||||
This is my here blog. It's not much at the moment (one post? booyah!), but maybe it'll grow. |
||||
|
||||
> #### *Lagom* is a Swedish word with no direct English equivalent, meaning "just the right amount" |
||||
Maybe not |
||||
|
||||
Lagom, a [Jekyll][j] blog theme with just the right amount of style. |
||||
* [Erlang, tcp sockets, and active true](erlang-tcp-socket-pull-pattern.md) (originally posted March 9, 2013) |
||||
* [go+](goplus.md) (originally posted July 11, 2013) |
||||
* [Generations](generations.md) (originally posted October 8, 2013) |
||||
|
||||
Extracted lovingly from [http://mdswanson.com][mds] for your enjoyment! |
||||
|
||||
* Responsive, based on [Skeleton][skeleton] |
||||
* [Font Awesome][font-awesome] for icons |
||||
* Open Sans from [Google web fonts][gfonts] |
||||
* Built-in Atom RSS feed |
||||
|
||||
## Action Shots |
||||
![](http://i.imgur.com/Pmzk4j1.png) |
||||
![](http://i.imgur.com/CT2Xvug.png) |
||||
![](http://i.imgur.com/XisjqW1.jpg) |
||||
|
||||
## Installation |
||||
|
||||
- Install Jekyll: `gem install jekyll` |
||||
- [Fork this repository][fork] |
||||
- Clone it: `git clone https://github.com/YOUR-USER/lagom` |
||||
- Run the jekyll server: `jekyll serve` |
||||
|
||||
You should have a server up and running locally at <http://localhost:4000>. |
||||
|
||||
## Customization |
||||
|
||||
Next you'll want to change a few things. Most of them can be changed directly in |
||||
[_config.yml][config]. That's where you can add your social links, change the accent |
||||
color, stuff like that. |
||||
|
||||
There's a few other places that you'll want to change, too: |
||||
|
||||
- [CNAME][cname]: If you're using this on GitHub Pages with a custom domain name, |
||||
you'll want to change this to be the domain you're going to use. All that should |
||||
be in here is a domain name on the first line and nothing else (like: `example.com`). |
||||
- [favicon.png][favicon]: This is the icon in your browser's address bar. You should |
||||
change it to whatever you'd like. |
||||
- [logo.png][logo]: A square-ish image that appears in the upper-left corner |
||||
|
||||
## Deployment |
||||
|
||||
You should deploy with [GitHub Pages][pages] - it's just easier. |
||||
|
||||
All you should have to do is rename your repository on GitHub to be |
||||
`username.github.io`. Since everything is on the `gh-pages` branch, you |
||||
should be able to see your new site at <http://username.github.io>. |
||||
|
||||
## Licensing |
||||
|
||||
[MIT](https://github.com/swanson/lagom/blob/master/LICENSE) with no |
||||
added caveats, so feel free to use this on your site without linking back to |
||||
me or using a disclaimer or anything silly like that. |
||||
|
||||
## Contact |
||||
I'd love to hear from you at [@_swanson][twitter]. Feel free to open issues if you |
||||
run into trouble or have suggestions. Pull Requests always welcome. |
||||
|
||||
[j]: http://jekyllrb.com/ |
||||
[mds]: http://mdswanson.com |
||||
[skeleton]: http://www.getskeleton.com/ |
||||
[font-awesome]: http://fortawesome.github.io/Font-Awesome/ |
||||
[gfonts]: http://www.google.com/fonts/specimen/Open+Sans |
||||
[fork]: https://github.com/swanson/lagom/fork |
||||
[config]: https://github.com/swanson/lagom/blob/master/_config.yml |
||||
[cname]: https://github.com/swanson/lagom/blob/master/CNAME |
||||
[favicon]: https://github.com/swanson/lagom/blob/master/favicon.png |
||||
[logo]: https://github.com/swanson/lagom/blob/master/logo.png |
||||
[pages]: http://pages.github.com |
||||
[twitter]: https://twitter.com/_swanson |
||||
That's all folks! |
||||
|
@ -0,0 +1,252 @@ |
||||
# Erlang, tcp sockets, and active true |
||||
|
||||
If you don't know erlang then [you're missing out][0]. If you do know erlang, |
||||
you've probably at some point done something with tcp sockets. Erlang's highly |
||||
concurrent model of execution lends itself well to server programs where a high |
||||
number of active connections is desired. Each thread can autonomously handle its |
||||
single client, greatly simplifying the logic of the whole application while |
||||
still retaining [great performance characteristics][1]. |
||||
|
||||
# Background |
||||
|
||||
For an erlang thread which owns a single socket there are three different ways |
||||
to receive data off of that socket. These all revolve around the `active` |
||||
[setopts][2] flag. A socket can be set to one of: |
||||
|
||||
* `{active,false}` - All data must be obtained through [recv/2][3] calls. This |
||||
amounts to syncronous socket reading. |
||||
|
||||
* `{active,true}` - All data on the socket gets sent to the controlling thread |
||||
as a normal erlang message. It is the thread's |
||||
responsibility to keep up with the buffered data in the |
||||
message queue. This amounts to asyncronous socket reading. |
||||
|
||||
* `{active,once}` - When set the socket is placed in `{active,true}` for a |
||||
single packet. That is, once set the thread can expect a |
||||
single message to be sent to when data comes in. To receive |
||||
any more data off of the socket the socket must either be |
||||
read from using [recv/2][3] or be put in `{active,once}` or |
||||
`{active,true}`. |
||||
|
||||
# Which to use? |
||||
|
||||
Many (most?) tutorials advocate using `{active,once}` in your application |
||||
\[0]\[1]\[2]. This has to do with usability and security. When in `{active,true}` |
||||
it's possible for a client to flood the connection faster than the receiving |
||||
process will process those messages, potentially eating up a lot of memory in |
||||
the VM. However, if you want to be able to receive both tcp data messages as |
||||
well as other messages from other erlang processes at the same time you can't |
||||
use `{active,false}`. So `{active,once}` is generally preferred because it |
||||
deals with both of these problems quite well. |
||||
|
||||
# Why not to use `{active,once}` |
||||
|
||||
Here's what your classic `{active,once}` enabled tcp socket implementation will |
||||
probably look like: |
||||
|
||||
```erlang |
||||
-module(tcp_test). |
||||
-compile(export_all). |
||||
|
||||
-define(TCP_OPTS, [ |
||||
binary, |
||||
{packet, raw}, |
||||
{nodelay,true}, |
||||
{active, false}, |
||||
{reuseaddr, true}, |
||||
{keepalive,true}, |
||||
{backlog,500} |
||||
]). |
||||
|
||||
%Start listening |
||||
listen(Port) -> |
||||
{ok, L} = gen_tcp:listen(Port, ?TCP_OPTS), |
||||
?MODULE:accept(L). |
||||
|
||||
%Accept a connection |
||||
accept(L) -> |
||||
{ok, Socket} = gen_tcp:accept(L), |
||||
?MODULE:read_loop(Socket), |
||||
io:fwrite("Done reading, connection was closed\n"), |
||||
?MODULE:accept(L). |
||||
|
||||
%Read everything it sends us |
||||
read_loop(Socket) -> |
||||
inet:setopts(Socket, [{active, once}]), |
||||
receive |
||||
{tcp, _, _} -> |
||||
do_stuff_here, |
||||
?MODULE:read_loop(Socket); |
||||
{tcp_closed, _}-> donezo; |
||||
{tcp_error, _, _} -> donezo |
||||
end. |
||||
``` |
||||
|
||||
This code isn't actually usable for a production system; it doesn't even spawn a |
||||
new process for the new socket. But that's not the point I'm making. If I run it |
||||
with `tcp_test:listen(8000)`, and in other window do: |
||||
|
||||
```bash |
||||
while [ 1 ]; do echo "aloha"; done | nc localhost 8000 |
||||
``` |
||||
|
||||
We'll be flooding the the server with data pretty well. Using [eprof][4] we can |
||||
get an idea of how our code performs, and where the hang-ups are: |
||||
|
||||
```erlang |
||||
1> eprof:start(). |
||||
{ok,<0.34.0>} |
||||
|
||||
2> P = spawn(tcp_test,listen,[8000]). |
||||
<0.36.0> |
||||
|
||||
3> eprof:start_profiling([P]). |
||||
profiling |
||||
|
||||
4> running_the_while_loop. |
||||
running_the_while_loop |
||||
|
||||
5> eprof:stop_profiling(). |
||||
profiling_stopped |
||||
|
||||
6> eprof:analyze(procs,[{sort,time}]). |
||||
|
||||
****** Process <0.36.0> -- 100.00 % of profiled time *** |
||||
FUNCTION CALLS % TIME [uS / CALLS] |
||||
-------- ----- --- ---- [----------] |
||||
prim_inet:type_value_2/2 6 0.00 0 [ 0.00] |
||||
|
||||
....snip.... |
||||
|
||||
prim_inet:enc_opts/2 6 0.00 8 [ 1.33] |
||||
prim_inet:setopts/2 12303599 1.85 1466319 [ 0.12] |
||||
tcp_test:read_loop/1 12303598 2.22 1761775 [ 0.14] |
||||
prim_inet:encode_opt_val/1 12303599 3.50 2769285 [ 0.23] |
||||
prim_inet:ctl_cmd/3 12303600 4.29 3399333 [ 0.28] |
||||
prim_inet:enc_opt_val/2 24607203 5.28 4184818 [ 0.17] |
||||
inet:setopts/2 12303598 5.72 4533863 [ 0.37] |
||||
erlang:port_control/3 12303600 77.13 61085040 [ 4.96] |
||||
``` |
||||
|
||||
eprof shows us where our process is spending the majority of its time. The `%` |
||||
column indicates percentage of time the process spent during profiling inside |
||||
any function. We can pretty clearly see that the vast majority of time was spent |
||||
inside `erlang:port_control/3`, the BIF that `inet:setopts/2` uses to switch the |
||||
socket to `{active,once}` mode. Amongst the calls which were called on every |
||||
loop, it takes up by far the most amount of time. In addition all of those other |
||||
calls are also related to `inet:setopts/2`. |
||||
|
||||
I'm gonna rewrite our little listen server to use `{active,true}`, and we'll do |
||||
it all again: |
||||
|
||||
```erlang |
||||
-module(tcp_test). |
||||
-compile(export_all). |
||||
|
||||
-define(TCP_OPTS, [ |
||||
binary, |
||||
{packet, raw}, |
||||
{nodelay,true}, |
||||
{active, false}, |
||||
{reuseaddr, true}, |
||||
{keepalive,true}, |
||||
{backlog,500} |
||||
]). |
||||
|
||||
%Start listening |
||||
listen(Port) -> |
||||
{ok, L} = gen_tcp:listen(Port, ?TCP_OPTS), |
||||
?MODULE:accept(L). |
||||
|
||||
%Accept a connection |
||||
accept(L) -> |
||||
{ok, Socket} = gen_tcp:accept(L), |
||||
inet:setopts(Socket, [{active, true}]), %Well this is new |
||||
?MODULE:read_loop(Socket), |
||||
io:fwrite("Done reading, connection was closed\n"), |
||||
?MODULE:accept(L). |
||||
|
||||
%Read everything it sends us |
||||
read_loop(Socket) -> |
||||
%inet:setopts(Socket, [{active, once}]), |
||||
receive |
||||
{tcp, _, _} -> |
||||
do_stuff_here, |
||||
?MODULE:read_loop(Socket); |
||||
{tcp_closed, _}-> donezo; |
||||
{tcp_error, _, _} -> donezo |
||||
end. |
||||
``` |
||||
|
||||
And the profiling results: |
||||
|
||||
```erlang |
||||
1> eprof:start(). |
||||
{ok,<0.34.0>} |
||||
|
||||
2> P = spawn(tcp_test,listen,[8000]). |
||||
<0.36.0> |
||||
|
||||
3> eprof:start_profiling([P]). |
||||
profiling |
||||
|
||||
4> running_the_while_loop. |
||||
running_the_while_loop |
||||
|
||||
5> eprof:stop_profiling(). |
||||
profiling_stopped |
||||
|
||||
6> eprof:analyze(procs,[{sort,time}]). |
||||
|
||||
****** Process <0.36.0> -- 100.00 % of profiled time *** |
||||
FUNCTION CALLS % TIME [uS / CALLS] |
||||
-------- ----- --- ---- [----------] |
||||
prim_inet:enc_value_1/3 7 0.00 1 [ 0.14] |
||||
prim_inet:decode_opt_val/1 1 0.00 1 [ 1.00] |
||||
inet:setopts/2 1 0.00 2 [ 2.00] |
||||
prim_inet:setopts/2 2 0.00 2 [ 1.00] |
||||
prim_inet:enum_name/2 1 0.00 2 [ 2.00] |
||||
erlang:port_set_data/2 1 0.00 2 [ 2.00] |
||||
inet_db:register_socket/2 1 0.00 3 [ 3.00] |
||||
prim_inet:type_value_1/3 7 0.00 3 [ 0.43] |
||||
|
||||
.... snip .... |
||||
|
||||
prim_inet:type_opt_1/1 19 0.00 7 [ 0.37] |
||||
prim_inet:enc_value/3 7 0.00 7 [ 1.00] |
||||
prim_inet:enum_val/2 6 0.00 7 [ 1.17] |
||||
prim_inet:dec_opt_val/1 7 0.00 7 [ 1.00] |
||||
prim_inet:dec_value/2 6 0.00 10 [ 1.67] |
||||
prim_inet:enc_opt/1 13 0.00 12 [ 0.92] |
||||
prim_inet:type_opt/2 19 0.00 33 [ 1.74] |
||||
erlang:port_control/3 3 0.00 59 [ 19.67] |
||||
tcp_test:read_loop/1 20716370 100.00 12187488 [ 0.59] |
||||
``` |
||||
|
||||
This time our process spent almost no time at all (according to eprof, 0%) |
||||
fiddling with the socket opts. Instead it spent all of its time in the |
||||
read_loop doing the work we actually want to be doing. |
||||
|
||||
# So what does this mean? |
||||
|
||||
I'm by no means advocating never using `{active,once}`. The security concern is |
||||
still a completely valid concern and one that `{active,once}` mitigates quite |
||||
well. I'm simply pointing out that this mitigation has some fairly serious |
||||
performance implications which have the potential to bite you if you're not |
||||
careful, especially in cases where a socket is going to be receiving a large |
||||
amount of traffic. |
||||
|
||||
# Meta |
||||
|
||||
These tests were done using R15B03, but I've done similar ones in R14 and found |
||||
similar results. I have not tested R16. |
||||
|
||||
* \[0] http://learnyousomeerlang.com/buckets-of-sockets |
||||
* \[1] http://www.erlang.org/doc/man/gen_tcp.html#examples |
||||
* \[2] http://erlycoder.com/25/erlang-tcp-server-tcp-client-sockets-with-gen_tcp |
||||
|
||||
[0]: http://learnyousomeerlang.com/content |
||||
[1]: http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-1 |
||||
[2]: http://www.erlang.org/doc/man/inet.html#setopts-2 |
||||
[3]: http://www.erlang.org/doc/man/gen_tcp.html#recv-2 |
||||
[4]: http://www.erlang.org/doc/man/eprof.html |
@ -0,0 +1,95 @@ |
||||
# Generations |
||||
|
||||
A simple file distribution strategy for very large scale, high-availability |
||||
file-services. |
||||
|
||||
# The problem |
||||
|
||||
Working at a shop where we have millions of different files, any of which could |
||||
be arbitrarily chosen to serve to a file at any given time. These files are |
||||
uploaded by users of the app and retrieved by others. |
||||
|
||||
Scaling such a system is no easy task. The chosen solution involves shuffling |
||||
files around on a nearly constant basis, making sure that files which are more |
||||
"popular" are on fast drives, while at the same time making sure that no drives |
||||
are at capicty and at the same time that all files, even newly uploaded ones, |
||||
are stored redundantly. |
||||
|
||||
The problem with this solution is one of coordination. At any given moment the |
||||
app needs to be able to "find" a file so it can give the client a link to |
||||
download the file from one of the servers that it's on. Full-filling this simple |
||||
requirement means that all datastores/caches where information about where a |
||||
file lives need to be up-to-date at all times, and even then there are |
||||
race-conditions and network failures to contend with, while at all times the |
||||
requirements of the app evolve and change. |
||||
|
||||
# A simpler solution |
||||
|
||||
Let's say you want all files which get uploaded to be replicated in triplicate |
||||
in some capacity. You buy three identical hard-disks, and put each on a separate |
||||
server. As files get uploaded by clients, each file gets put on each drive |
||||
immediately. When the drives are filled (which should be at around the same |
||||
time), you stop uploading to them. |
||||
|
||||
That was generation 0. |
||||
|
||||
You buy three more drives, and start putting all files on them instead. This is |
||||
going to be generation 1. Repeat until you run out of money. |
||||
|
||||
That's it. |
||||
|
||||
## That's it? |
||||
|
||||
It seems simple and obvious, and maybe it's the standard thing which is done, |
||||
but as far as I can tell no-one has written about it (though I'm probably not |
||||
searching for the right thing, let me know if this is the case!). |
||||
|
||||
## Advantages |
||||
|
||||
* It's so simple to implement, you could probably do it in a day if you're |
||||
starting a project from scratch |
||||
|
||||
* By definition of the scheme all files are replicated in multiple places. |
||||
|
||||
* Minimal information about where a file "is" needs to be stored. When a file is |
||||
uploaded all that's needed is to know what generation it is in, and then what |
||||
nodes/drives are in that generation. |
||||
|
||||
* Drives don't need to "know" about each other. What I mean by this is that |
||||
whatever is running as the receive point for file-uploads on each drive doesn't |
||||
have to coordinate with its siblings running on the other drives in the |
||||
generation. In fact it doesn't need to coordinate with anyone. You could |
||||
literally rsync files onto your drives if you wanted to. I would recommend using |
||||
[marlin][0] though :) |
||||
|
||||
* Scaling is easy. When you run out of space you can simply start a new |
||||
generation. If you don't like playing that close to the chest there's nothing to |
||||
say you can't have two generations active at the same time. |
||||
|
||||
* Upgrading is easy. As long as a generation is not marked-for-upload, you can |
||||
easily copy all files in the generation into a new set of bigger, badder drives, |
||||
add those drives into the generation in your code, remove the old ones, then |
||||
mark the generation as uploadable again. |
||||
|
||||
* Distribution is easy. You just copy a generation's files onto a new drive in |
||||
Europe or wherever you're getting an uptick in traffic from and you're good to |
||||
go. |
||||
|
||||
* Management is easy. It's trivial to find out how many times a file has been |
||||
replicated, or how many countries it's in, or what hardware it's being served |
||||
from (given you have easy access to information about specific drives). |
||||
|
||||
## Caveats |
||||
|
||||
The big caveat here is that this is just an idea. It has NOT been tested in |
||||
production. But we have enough faith in it that we're going to give it a shot at |
||||
cryptic.io. I'll keep this page updated. |
||||
|
||||
The second caveat is that this scheme does not inherently support caching. If a |
||||
file suddenly becomes super popular the world over your hard-disks might not be |
||||
able to keep up, and it's probably not feasible to have an FIO drive in *every* |
||||
generation. I think that [groupcache][1] may be the answer to this problem, |
||||
assuming your files are reasonably small, but again I haven't tested it yet. |
||||
|
||||
[0]: https://github.com/cryptic-io/marlin |
||||
[1]: https://github.com/golang/groupcache |
@ -0,0 +1,73 @@ |
||||
# Go and project root |
||||
|
||||
Compared to other languages go has some strange behavior regarding its project |
||||
root settings. If you import a library called `somelib`, go will look for a |
||||
`src/somelib` folder in all of the folders in the `$GOPATH` environment |
||||
variable. This works nicely for globally installed packages, but it makes |
||||
encapsulating a project with a specific version, or modified version, rather |
||||
tedious. Whenever you go to work on this project you'll have to add its path to |
||||
your `$GOPATH`, or add the path permanently, which could break other projects |
||||
which may use a different version of `somelib`. |
||||
|
||||
My solution is in the form of a simple script I'm calling go+. go+ will search |
||||
in currrent directory and all of its parents for a file called `GOPROJROOT`. If |
||||
it finds that file in a directory, it prepends that directory's absolute path to |
||||
your `$GOPATH` and stops the search. Regardless of whether or not `GOPROJROOT` |
||||
was found go+ will passthrough all arguments to the actual go call. The |
||||
modification to `$GOPATH` will only last the duration of the call. |
||||
|
||||
As an example, consider the following: |
||||
``` |
||||
/tmp |
||||
/hello |
||||
GOPROJROOT |
||||
/src |
||||
/somelib/somelib.go |
||||
/hello.go |
||||
``` |
||||
|
||||
If `hello.go` depends on `somelib`, as long as you run go+ from `/tmp/hello` or |
||||
one of its children your project will still compile |
||||
|
||||
Here is the source code for go+: |
||||
|
||||
```bash |
||||
#!/bin/sh |
||||
|
||||
SEARCHING_FOR=GOPROJROOT |
||||
ORIG_DIR=$(pwd) |
||||
|
||||
STOPSEARCH=0 |
||||
SEARCH_DIR=$ORIG_DIR |
||||
while [ $STOPSEARCH = 0 ]; do |
||||
|
||||
RES=$( find $SEARCH_DIR -maxdepth 1 -type f -name $SEARCHING_FOR | \ |
||||
grep -P "$SEARCHING_FOR$" | \ |
||||
head -n1 ) |
||||
|
||||
if [ "$RES" = "" ]; then |
||||
if [ "$SEARCH_DIR" = "/" ]; then |
||||
STOPSEARCH=1 |
||||
fi |
||||
cd .. |
||||
SEARCH_DIR=$(pwd) |
||||
else |
||||
export GOPATH=$SEARCH_DIR:$GOPATH |
||||
STOPSEARCH=1 |
||||
fi |
||||
done |
||||
|
||||
cd "$ORIG_DIR" |
||||
exec go $@ |
||||
``` |
||||
|
||||
# UPDATE: Goat |
||||
|
||||
I'm leaving this post for posterity, but go+ has some serious flaws in it. For |
||||
one, it doesn't allow for specifying the version of a dependency you want to |
||||
use. To this end, I wrote [goat][0] which does all the things go+ does, plus |
||||
real dependency management, PLUS it is built in a way that if you've been |
||||
following go's best-practices for code organization you shouldn't have to change |
||||
any of your existing code AT ALL. It's cool, check it out. |
||||
|
||||
[0]: http://github.com/mediocregopher/goat |
Binary file not shown.
@ -1,2 +0,0 @@ |
||||
_site/ |
||||
.DS_Store |
@ -0,0 +1,27 @@ |
||||
#!/bin/sh |
||||
|
||||
SEARCHING_FOR=GOPROJROOT |
||||
ORIG_DIR=$(pwd) |
||||
|
||||
STOPSEARCH=0 |
||||
SEARCH_DIR=$ORIG_DIR |
||||
while [ $STOPSEARCH = 0 ]; do |
||||
|
||||
RES=$( find $SEARCH_DIR -maxdepth 1 -type f -name $SEARCHING_FOR | \ |
||||
grep -P "$SEARCHING_FOR$" | \ |
||||
head -n1 ) |
||||
|
||||
if [ "$RES" = "" ]; then |
||||
if [ "$SEARCH_DIR" = "/" ]; then |
||||
STOPSEARCH=1 |
||||
fi |
||||
cd .. |
||||
SEARCH_DIR=$(pwd) |
||||
else |
||||
export GOPATH=$SEARCH_DIR:$GOPATH |
||||
STOPSEARCH=1 |
||||
fi |
||||
done |
||||
|
||||
cd "$ORIG_DIR" |
||||
exec go $@ |
Loading…
Reference in new issue