?

Log in

No account? Create an account

Previous Entry Share Next Entry
Querying nomadic data objects with LINQ
tedstuartav

I 'm employing US census data
to consult the gauged populations of the metropolises and towns running elmcity
hubs. The dataset is only apparent old CSV ( comma-separated variable ), a formatting that Holds more popular than ever thanks in constituent to a new wave of web-based data services like DabbleDB
ManyEyes
, and others.


For my designs, simple shape matching was plenty to consult the population of a metropolis and province. But I 'd been intending to attempt out LINQtoCSV
, the Cyberspace equivalent of my old friend, Python 's csv
module. As happens lately
, I was struck by the convergence of the languages. Here Holds a adjacent comparing of Python and C # applying their respective CSV faculties to question for the population of Keene, NH:


Python C #
 
 
i_name = 5
i_statename = 6
 
i_pop2008 = 17
 
 
handle = urllib.urlopen(url)
 
 
 
 
 
 
 
 
reader = csv.reader(
  handle, delimiter=',')
 
 
rows = itertools.ifilter(lambda x :
  x[i_name].startswith('Keene') and
  x[i_statename] == 'New Hampshire',
    reader)
 
found_rows = list(rows)
 
 
 
count = len(found_rows)
 
if ( count > 0 ):
  pop = int(found_rows[0][i_pop2008])

public class USCensusPopulationData
  {
  public string NAME;
  public string STATENAME;
  ... etc. ...
  public string POP_2008;
  }
 
var csv = new WebClient().
  DownloadString(url);
 
var stream = new MemoryStream(
  Encoding.UTF8.GetBytes(csv));
var sr = new StreamReader(stream);
var cc = new CsvContext();
var fd = new CsvFileDescription { };
 
var reader =
  cc.Read<USCensusPopulationData>(sr, fd);
 
 
var rows = reader.ToList();
 
 
 
 
var found_rows = rows.FindAll(row =>
  row.name.StartsWith('Keene') &&
  row.statename == 'New Hampshire');
 
var count = rows.Count;
 
if ( count > 0 )
  pop = Convert.ToInt32(
    found_rows[0].pop_2008)


Things make n't line upwardly rather equally neatly as in my earlier representative, or as in the A/B comparing ( since a long time ago in 2005 ) between my first LINQ example
and SAM Ruby 's Ruby equivalent
But the two illustrations share a common attack based on iterators and filters.


This thought of running inquiries over simple document is something I foremost ran into lang syne in the descriptor of the ODBC Text driver, which supplies SQL questions over comma-separated data. I 've e'er loved this mode of data admission, and it stays improbably handy. Yes, some data sets are immense. But the 80, 000 rows of that nosecount file add upward to simply 8MB. The file is n't turning quickly, and it can say a good deal of narratives. Here Holds one:


2000 - 2008 population loss in NH

-8.09% Berlin city
-3.67% Coos County
-1.85% Portsmouth city
-1.85% Plaistow town
-1.78% Balance of Coos County
-1.43% Claremont city
-1.02% Lancaster town
-0.99% Rye town
-0.81% Keene city
-0.23% Nashua city

In both Python and C # you can work straightly with the iterators returned by the CSV faculties to action this kinda question. Here Holds a Python version:


import urllib, itertools, csv

i_name = 5
i_statename = 6
i_pop2000 = 9
i_pop2008 = 17

def make_reader():
  handle = open('pop.csv')
  return csv.reader(handle, delimiter=',')

def unique(rows):
  dict = {}
  for row in rows:
    key = "%s %s %s %s" % (i_name, i_statename,
      row[i_pop2000], row[i_pop2008])
    dict[key] = row
  list = []
  for key in dict:
    list.append( dict[key] )
  return list

def percent(row,a,b):
  pct = - (  float(row[a]) / float(row[b]) * 100 - 100 )
  return pct

def change(x,state,minpop=1):
  statename = x[i_statename]
  p2000 = int(x[i_pop2000])
  p2008 = int(x[i_pop2008])
  return (  statename==state and
            p2008 > minpop   and
            p2008 < p2000 )

state = New Hampshire

reader = make_reader()
reader.next() # skip fieldnames

rows = itertools.ifilter(lambda x :
  change(x,state,minpop=3000), reader)

l = list(rows)
l = unique(l)
l.sort(lambda x,y: cmp(percent(x,i_pop2000,i_pop2008),
  percent(y,i_pop2000,i_pop2008)))

for row in l:
  print "%2.2f%% %s" % (
       percent(row,i_pop2000,i_pop2008),
       row[i_name] )

A real C # version could make notwithstanding things in the same shipways: Convert the iterator into a listing, utilise a lexicon to take duplicate, filtrate the listing with a lambda mapping, sieved the listing with another lambda mapping.


As questions turn more complex, though, you lean to desire a more declaratory way. To make that in Python, you 'd likely import the CSV file into a SQL database possibly SQLite in order to rest true to the lightweight nature of this illustration. So you 'd send enquiries to the database in the descriptor of SQL statements. But you 're covering a chasm when you make that. The database 's type system is n't the same as Python 's. And database 's internal language for inditing mappings wo n't be Python either. In the example of SQLite, there wo n't even be an internal language.


With LINQ there Holds no chasm to traverse. Here Holds the LINQ codification that produces the same outcome:


var census_rows = make_reader();

var distinct_rows = census_rows.Distinct(new CensusRowComparer());

var threshold = 3000;

var rows =
  from row in distinct_rows
  where row.STATENAME == statename
      && Convert.ToInt32(row.POP_2008) > threshold
      && Convert.ToInt32(row.POP_2008) < Convert.ToInt32(row.POP_2000)
  orderby percent(row.POP_2000,row.POP_2008)
  select new
    {
    name = row.NAME,
    pop2000 = row.POP_2000,
    pop2008 = row.POP_2008
    };

 foreach (var row in rows)
   Console.WriteLine("{0:0.00}% {1}",
     percent(row.pop2000,row.pop2008), row.name );

You can see the supporting pieces below. There are a figure of facets to this attack that I 'm basking. It Holds utile, for instance, that every row of information gets an object whose holdings are available to the editor and the debugger. But what verily pleases me is the fashion that the inquiry context and the issues context part the same environment, even as in the Python illustration above. Therein ( slightly planned ) illustration I 'm employing the pct mapping in both contexts.


With LINQ to CSV I 'm now applying four flavors of LINQ in my undertaking. Two are constructed into the Cyberspace Model: LINQ to XML, and LINQ to native Net objects. And two are extensions: LINQ to CSV, and LINQ to JSON
All told four instances, I 'm querying some rather roving data object: an RSS provender, a binary Internet object regained from the Cerulean blob shop, a JSON response, and now a CSV file.


Six geezerhood ago I was component of a delegacy from InfoWorld that seed Microsoft for a trailer of engineerings in the grapevine. At a dinner I sat with Anders Hejslberg and listened to him lay out his vision for what would get LINQ. There were two primal ends. Foremost, a individual environment for question and issues. 2nd, a common attack to many flavors of data.


I believe he nailed both passably goodly. And it Holds timely because the cloud is n't but an ecosystem of services, it Holds too an ecosystem of wandering data objects that come in a assortment of flavors.


Related posts:
Nilulutong Cha-cha-cha: ANG Ibidinsiya!

Ignite Poets: Two Countries, One Fire, Oct 7 2009 @ Confederation

Chesterton Turns 135 on May 29