Data analysis extension for postgresql

I was trying to build an in-database recommendation system using collaborative filtering and postgresql was appealing because its support of array types. But quickly I found myself in need of even basic linear algebra functions, and I only needed summation (both in-line and aggregate), scalar multiplication as well as dot product. I did these in pl/python just to see if my concept was working (it was!), but, as you can guess, it was quite slow.

A quick search revealed MADlib, an extension that can do a lot more than basic linear algebra. It also does descriptive and inferential statistics, linear and logistic regression, k-means clustering and a lot more.

You can check the code on github, and there is a rpm binary package for CentOS. (I work on arch linux, so I just needed to extract the package with rpmextract and then copy it to my root.) After installation, look for the bin/madpack binary for deployment to your database.
comment? published: 97 months ago. tags: data-analysis, machine-learning, postgresql, statistics

Some performance tips when dealing with data in python

One of the main obstacles of python achieving domination in the machine learning / data mining field is probably the talk of it being not efficient enough. There is however, way of achieving better performance if you're careful enough. Bellow are some excellent suggestions, some of them I have personally tried (learned the hard way), such as using namedtuples instead of classes, or parsing csv with int/float instead of the csv parser from the standard library; as well as some of the numpy's more obscure routines for searching in arrays. Also, once you get used to profiling - you easily become addicted.

Expensive lessons in Python performance tuning
comment? published: 109 months ago. tags: data-mining, machine-learning, optimization, pandas, python, scipy

machine learning notes