Archive for the ‘statistics’ Category


August 10, 2007

quite possibly the most effective weapon ever developed during the great blog wars (2003-….).

outliers – is it the data or the theory?

July 19, 2007

During my brief(?) stint at the Treasury, my coworkers and I frequently excluded outliers from data because their position on the scatterplot didn’t mesh with our understanding of community finance.  We simply assumed the data was invalid in some way.  It certainly made for a *slightly* more coherent understanding of an exceptionally chaotic set of data.

what you really shouldn’t do—especially when the cases are in other respects quite similar, such as all being functioning, rich capitalist democracies—is label entire countries as “outliers” in order to remove them from your analysis, and then pretend that this has made them disappear from the face of the earth, too.

[Outliers, at crooked timber]

The problem, to me, occurs where the statistical trend between a limited number of variables is completely insufficient to examine a data set whose causal picture is, in fact, incredibly complex.  For the case recently discussed, whether more taxes (counter intuitively) produce less revenue… is much like predicting weather changes based on the activity of the butterflies in my backyard.  No doubt there is an effect, good luck creating the regression.

9-11, Derrida, and the West Wing

August 10, 2006

some 30% of americans cannot say “in what year the September 11, 2001 terrorist attacks against New York’s World Trade Center and the Pentagon in Washington took place”.

what is fascinating to me is that September 11th is intimately associated with time, though never year. from philosophy in a time of terror:

“Something” took place, we have the feeling of not having seen it coming, and certain consequences undeniably follow upon the “thing.” But this very thing, the place and meaning of this “event,” remains ineffable, like an intuition without concept, like a unicity with no generality on the horizon or with no horizon at all, out of range for a language that admits its powerlessness and so is reduced to pronouncing mechanically a date, repeating it endlessly, as a kind of ritual incantation, a conjuring poem, a journalistic litany or rhetorical refrain that admits to not knowing what it’s talking about. We do not in fact know what we are saying or naming in this way: September 11, le 11 septembre, September 11. The brevity of the appellation (September 11, 9/11) stems not only from an economic or rhetorical necessity. The telegram of this metonymy—a name, a number—points out the unqualifiable by recognizing that we do not recognize or even cognize that we do not yet know how to qualify, that we do not know what we are talking about.

the event was unspeakable, literally. so it was associated with a time. so, if they’ve forgotten the year (and, as with most polls that show this, i often wonder at the question) – what impression did the event leave? is it eternally recurring? a floating point in the past?

even now, nearly 5 years later, we have no other word (beyond time). you’ll notice the great lengths the AP went to, without recourse to just “9/11”.


census estimates

July 24, 2006

on july 21, 2006 dc got 31,528 new people.  well, at least if you’re the US census.  the methodology used?  the washington post article notes

the city submitted building permits from 1999 to 2005, Phillips said, in addition to school enrollment figures showing that public charter schools had absorbed much of a decline in the number of students attending public schools. The city also submitted data showing that the number of people filing taxes in the District had remained steady between 2004 and 2005 and that Pepco was serving an increasing number of residential units.

but you’d think that if this was done in DC, it’d be standard practice for all cities.  i’m actually kinda surprised the census didn’t incorporate these figures already.  but the .pdf i could find didn’t mention anything.

the more i look at these figures, the less i trust them.  lies, damn lies,…

more statistics!

July 24, 2006

statNotes: topics in multivariate analysis

statisitical resources at the UMichigan

statistics jokes!

county and city data tables
oh my, i love coupling – “i’m not that easy, but show me a muscular blond who can control the weather and this girl is on all fours”

poverty & the middle class

July 24, 2006

the la times has a piece on the growing gap between rich and poor, and the shrinking neighborhoods of middle-class families.

“The retailers in the two neighborhoods are very different,”… “It’s the difference between a Whole Foods and a corner grocer, or Citibank and the local check casher. They’re not competing, and in the end, you have higher prices for all basic goods and services.”

urban centers are becoming the domain of the richest and the poorest, as the middle class is increasingly moved to the suburbs.  this limits the upward mobility of the poor, cutting off moving into a better neighborhood, going to better schools, or maintaining social contacts.

you can bet i’ll be reading the brookings report later on.

[update:  a rich discussion over at feministe]

textMap, so so cool – but how does it work?

July 22, 2006

i am the absolute worst when it comes to methodologies and titles (titles, you probably guessed from these posts).  but it is becoming increasingly apparent that these are at the core of great statistics / information display / research.  take textmap, an engine to analyize the geographic and temporal distribution of news.  it is really quite cool, and something i’ve wanted to do for a while, but it always seemed like there were too many problems to be overcome, before the idea became workable. so i was psyched fo find the site.

-but a problem-

playing with the ‘function of location’ charts has me worried.  montana has a relatively few news sources, and therefore never shows a strong reading.  the east coast, however – particularly the metropolitan corridor – is a sea of red (more news sources in the area).  so, there is variation in both areas, but it isn’t entirely clear what the map is measuring, because comparing across regions is no longer intuitive.  i couldn’t find the methodology on the site (boo!) – and so it isn’t clear what intensity of red indicates.

this isn’t to say the site isn’t worthwhile – the mexico map shows an intuitive trend

Mexico TextMap

but i have to wonder if this is an artifact of paper coverage – why the band between n. florida and s. georgia? – and wonder about coverage in relation to associated thoughts.  (what is the unit of analysis, btw – census tract?)

there is also the old baseline problem: what is the ‘noise’ associated with a given concept (background usage not associated with events)? – and what is the median frequency of ‘related’ terms – its cool if mexico usage went up, but if that was a function of world cup news or a function of immigration news makes a big difference.

hm… actually, with this data set, you could probably look at news conglomeration::variety of media sources, if the answers to above were clear… ooh, shiny.

data makes me do the happy dance

July 21, 2006
datamining has an interactive map of the blogosphere. the map layout is a “variant of the force layout approach to graph layout. There certainly is meaning to the location of nodes in the image: proximity indicates a tendancy for mutual citation.” meaning: the map is more than just a pretty face. the place of nodes has actual social meaning.

but this is even more sexy, as a suggestion:

Time stability is an interesting problem. One way to do this is to fix nodes in location (or certain nodes). Alternatively, you could allow nodes to become more lethargic in movement according to how long they have been there. This seems like a good idea. Are you going for some form of animated representation?

dangit, where is my programming computer when i need it!

[update 1]: ok, i heart datamining. this visualization method is pretty darn inspiring, and pretty straightforward to understand (compared to other methods i’ve read)

we start by giving some amount of money to some user (initiator) in LJ network telling him to evenly distribute it among his friends, then his friends are performing the same action among their friends and so on. Obviously, if these guys are the members of some clique it will not take too long until all of them have an equal amount of money (thanks to small-world property), meanwhile only some small part of the initial amount will leave this community. So the amount of money of a particular user defines his thermodynamic distance from the initiator. If we have two initiators – we can plot the figure like the one shown here.

expect more updates as i read through the whole archives this weekend.

[update 2]: don’t run too far through the links. i accidentally made it to ‘linked’, a book that makes me angry. hulk angry


June 29, 2006

via crooked timber and jim gibbon, i stumbled into gapminder, a neat data-visualization package available online (alternate link). so so cool, and not just for us geeks.

while i’m at it…

neat link on the monte carlo method
we all use it, i just need to store links: mathworld
possibly going on my sidebar: social science stat blog. too cool