ODI : intro to data science notes

  • Discover open data science
  • Machine learning and classification
  • Visualisation and communication

Course – http://training.theodi.org/ods/#/id/co-0

Key Skills

  • Venn Diagram (DS skills)
  • Ethics – de-anomalies ?
  • IBM advert – What’s the values?
  • Proxy measure – tfl- load on axial
  • Chapter impact – Food Agency – view browsing history

“Part Analyst, part artist” – Anjul Bhambhri (VP of big data at IBM)



Data Science skills are in demand

8 key areas

  1. Big Data (80-90% mention for DS jobs)
  2. Data Collection and Analysis
  3. Machine Learning prediction
  4. Maths and Statistics
  5. Interpretation and Visualisation
  6. Advance computing and programming
  7. Business Intelligence and Domain Expertise
  8. Open Source Tools and Concepts & Open Innovation

Big Data

Filtering and processing (6 million rows of data in 5 minutes https://socrata.com/ )

Pivot tables on big data in the cloud : https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data

Commodity scalable cloud computing is ready to e used. Build and test small and then scale on demand. Amazon EC2, Heroku, Cloudflare are all great examples. Socrata and tableau are others.

Explain and use “data on the web” and  the “web of data”

Finding data

Data.gov.XX – Advance search – Site: Link: related: filetype:

File types

  • XLS – tab
  • CSV – tab
  • TSV – tabular
  • XML – hierachical
  • JSON – hierachical
  • YAML
  • PDF

Tabular- table form

Hierarchical – one way relation

Network – social -multiple directions


Portal Aggregators

Transport API – http://www.transportapi.com/

Enigma.io – google for open data



PDFTABLES.COM https://pdftables.com/pricing – only for 50 paged free

magic.import.io – hosted version could deal with credentials log in. Terms of use – for research or reporting


Elastic search

Document and data

Webpages http://bbc.co.uk/news/

Rss feed – reduce due to advertisement revenue http://feeds.bbci.co.uk/news/rss.xml



  • ISBN
  • Postcode
  • MAC address

Cool identifier

Instead of html

Data browsers

Add extension .xml

RDF browser –  Q&D RDF browser



View using Postman



Doc – html website

Data – building data

Query  – building


5-star http://5stardata.info/

  • Open license
  • Readable
  • Open format
  • Machine learning and Prediction
  • Decision Tree – Decision tree for audio guide

Types of ML

  • Supervised learning
  • Unsupervised learning
  • Semi supervised learning
  • Re-inforcement learning


  • Clustering
  • Regression


D3 -Visualisation Cross Filter


Code pen online – experiment