graph-analysis-networkx(Python)

Loading...

Graph analysis with NetworkX

This notebook goes over basic graph analysis using NetworkX. NetworkX is a Python package for creating, manipulating, and studying the structure, dynamics, and functions of complex networks. The notebook uses examples from the NetworkX documentation and also built-in datasets in Databricks.

NetworkX is preinstalled in Databricks Runtime for Machine Learning.

What is graph analysis?

Graphs excel at representing and analyzing relationships between different entities. Using graphs is helpful to understanding and analyzing problems in many different businesses, such as transportation, logistics, communications networks, product recommendations, social networks, and many more.

In a graph, the nodes or vertices represent entities that are linked in some way, and edges represent the links, or relationships, between entities. This notebook includes several examples.

Requirements

This notebook requires Databricks Runtime for Machine Learning.

Citation

  • NetworkX. Aric A. Hagberg, Daniel A. Schult and Pieter J. Swart, “Exploring network structure, dynamics, and function using NetworkX”, in Proceedings of the 7th Python in Science Conference (SciPy2008), Gäel Varoquaux, Travis Vaught, and Jarrod Millman (Eds), (Pasadena, CA USA), pp. 11–15, Aug 2008.
2

Part 1. Basic examples

This section shows some examples of basic graph properties using graphs that are built in to NetworkX. The first example is a simulated internet graph. The nodes represent hosts (computers or other devices), and the edges represent links between the hosts.

Create the network

5

    Draw the network using NetworkX and Matplotlib

    7

    Some basic network analysis

    The degree of a node in an undirected network is the number of edges connected to that node. The following command displays the degree of each node in the network.

    9

      [(7, 14), (8, 14), (6, 13), (2, 11), (3, 10), (9, 8), (5, 7), (11, 7), (1, 6), (4, 6), (0, 5), (10, 5), (12, 2), (13, 2), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1)]

      Neighbors of a node

      The all_neighbors command shows all of the nodes that are directly linked to the node specified. In the example below, all the neighbors to which node 6 is linked are shown in the output.

      11

        [3, 2, 22, 24, 25, 30, 39, 40, 49, 4, 5, 9, 8]

        Radius, diameter, eccentricity and other measures

        See the Databricks documentation (AWS | Azure | GCP) for details of the following properties.

        13

        radius: 3 diameter: 4 eccentricity: {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3, 12: 4, 13: 4, 14: 4, 15: 4, 16: 4, 17: 4, 18: 4, 19: 4, 20: 4, 21: 4, 22: 4, 23: 4, 24: 4, 25: 4, 26: 4, 27: 4, 28: 4, 29: 4, 30: 4, 31: 4, 32: 4, 33: 4, 34: 4, 35: 4, 36: 4, 37: 4, 38: 4, 39: 4, 40: 4, 41: 4, 42: 4, 43: 4, 44: 4, 45: 4, 46: 4, 47: 4, 48: 4, 49: 4} center: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] periphery: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49] density: 0.05959183673469388

        Shortest path

        The shortest path algorithm finds the shortest path between two nodes. The result shows the nodes along the path.

        15

          [33, 7, 2, 8, 47]

          Small world networks

          Many real-world networks are "small world" networks, characterized by densely connected subnetworks and a low average shortest path distance. The omega function in NetworkX is a measure of how closely a graph represents a small world. For more information, see the NetworkX omega documentation.

          17

          -0.002180774748923975

          Degree distribution

          The degree distribution of a network is the number of nodes of each degree. The degree distribution is an important characteristic of a network and indicates if the network

          19

          Part 2. Analyzing flight data

          The rest of this notebook analyzes flight data using a dataset that is built into Databricks.

          Load and view data

          22

          Create the graph

          For flight data such as this dataset, the direction is important. A flight from San Francisco to Los Angeles is not equivalent to a flight from Los Angeles to San Francisco. To represent this dataset, you must create a directed graph, called a DiGraph in NetworkX.

          Use the from_pandas_edgelist method to create a NetworkX DiGraph. For details, see the NetworkX documentation.

          24

          Degree centrality

          Degree centrality is a measure of the number of nodes that a node is connected to. The results from the following cell show that busy hub airports such as Atlanta (ATL), Chicago O'Hare (ORD), and Dallas-Fort Worth (DFW) have the highest degree centrality.

          26

            {'ATL': 1.197411003236246, 'ORD': 0.9676375404530745, 'DFW': 0.854368932038835, 'MSP': 0.8381877022653722, 'DEN': 0.7669902912621359, 'CVG': 0.7508090614886732, 'SLC': 0.7443365695792881, 'IAH': 0.7443365695792881, 'DTW': 0.7411003236245955, 'LAS': 0.5922330097087379, 'EWR': 0.5889967637540453, 'LAX': 0.5792880258899676, 'PHX': 0.56957928802589, 'MCO': 0.5372168284789645, 'MEM': 0.49838187702265374, 'CLT': 0.46925566343042074, 'JFK': 0.46601941747572817, 'CLE': 0.4563106796116505, 'SFO': 0.44012944983818775, 'BWI': 0.4336569579288026, 'IAD': 0.4336569579288026, 'BOS': 0.41423948220064727, 'LGA': 0.4077669902912622, 'TPA': 0.4045307443365696, 'SEA': 0.3786407766990291, 'MDW': 0.3754045307443366, 'PHL': 0.37216828478964403, 'DCA': 0.3592233009708738, 'AUS': 0.34304207119741104, 'SAN': 0.33980582524271846, 'FLL': 0.33333333333333337, 'STL': 0.3300970873786408, 'MCI': 0.31715210355987056, 'MIA': 0.31067961165048547, 'BNA': 0.3009708737864078, 'RDU': 0.29449838187702265, 'PDX': 0.27184466019417475, 'MSY': 0.27184466019417475, 'SMF': 0.2621359223300971, 'PIT': 0.2621359223300971, 'SAT': 0.255663430420712, 'ABQ': 0.2524271844660194, 'MKE': 0.2524271844660194, 'IND': 0.2394822006472492, 'ONT': 0.23624595469255666, 'OAK': 0.23624595469255666, 'JAX': 0.23624595469255666, 'CMH': 0.22977346278317154, 'HOU': 0.21035598705501618, 'RSW': 0.21035598705501618, 'BDL': 0.20711974110032363, 'BHM': 0.19741100323624597, 'SJC': 0.19093851132686085, 'SDF': 0.1877022653721683, 'TUS': 0.18446601941747573, 'COS': 0.18446601941747573, 'ANC': 0.18446601941747573, 'OKC': 0.18122977346278318, 'PBI': 0.1779935275080906, 'SNA': 0.17475728155339806, 'TUL': 0.17475728155339806, 'HNL': 0.16181229773462785, 'SJU': 0.15533980582524273, 'MSN': 0.15210355987055016, 'BUF': 0.1488673139158576, 'OMA': 0.14563106796116507, 'RNO': 0.14563106796116507, 'BOI': 0.1423948220064725, 'ORF': 0.1423948220064725, 'PVD': 0.13592233009708737, 'ELP': 0.13592233009708737, 'GSO': 0.13592233009708737, 'LIT': 0.13268608414239483, 'RIC': 0.12944983818770228, 'OGG': 0.1262135922330097, 'GRR': 0.12297734627831716, 'MHT': 0.1197411003236246, 'ROC': 0.1197411003236246, 'GEG': 0.11326860841423948, 'ALB': 0.11003236245954694, 'PSP': 0.11003236245954694, 'TYS': 0.11003236245954694, 'SAV': 0.10679611650485438, 'XNA': 0.10679611650485438, 'GJT': 0.10679611650485438, 'DAL': 0.10355987055016182, 'CHS': 0.10355987055016182, 'DAY': 0.10032362459546926, 'SYR': 0.10032362459546926, 'DSM': 0.10032362459546926, 'FAT': 0.10032362459546926, 'LGB': 0.10032362459546926, 'PWM': 0.0970873786407767, 'SRQ': 0.09385113268608415, 'GSP': 0.09061488673139159, 'BUR': 0.08737864077669903, 'JAN': 0.08737864077669903, 'EGE': 0.08414239482200647, 'SBA': 0.08414239482200647, 'SGF': 0.08414239482200647, 'PIH': 0.08414239482200647, 'HPN': 0.08090614886731393, 'CAK': 0.08090614886731393, 'LEX': 0.07766990291262137, 'HSV': 0.07766990291262137, 'TWF': 0.07766990291262137, 'CAE': 0.0744336569579288, 'MYR': 0.0744336569579288, 'MLI': 0.0744336569579288, 'MDT': 0.07119741100323625, 'KOA': 0.07119741100323625, 'MRY': 0.07119741100323625, 'ICT': 0.06796116504854369, 'MTJ': 0.06796116504854369, 'CID': 0.06796116504854369, 'HDN': 0.06796116504854369, 'STT': 0.06796116504854369, 'BFL': 0.06472491909385114, 'BTV': 0.06472491909385114, 'FNT': 0.06472491909385114, 'FSD': 0.06148867313915858, 'ISP': 0.05825242718446602, 'BZN': 0.05825242718446602, 'PIA': 0.05825242718446602, 'PNS': 0.05501618122977347, 'BTR': 0.05501618122977347, 'ABE': 0.05501618122977347, 'JAC': 0.05501618122977347, 'IDA': 0.05501618122977347, 'MAF': 0.05177993527508091, 'GPT': 0.05177993527508091, 'EUG': 0.05177993527508091, 'GRB': 0.05177993527508091, 'MFR': 0.05177993527508091, 'SWF': 0.05177993527508091, 'LBB': 0.04854368932038835, 'SHV': 0.04854368932038835, 'BGR': 0.04854368932038835, 'DAB': 0.045307443365695796, 'VPS': 0.045307443365695796, 'LIH': 0.045307443365695796, 'AVP': 0.045307443365695796, 'FCA': 0.045307443365695796, 'LNK': 0.045307443365695796, 'TVC': 0.045307443365695796, 'JNU': 0.045307443365695796, 'AMA': 0.042071197411003236, 'CHA': 0.042071197411003236, 'FWA': 0.042071197411003236, 'SBN': 0.042071197411003236, 'ATW': 0.042071197411003236, 'MSO': 0.042071197411003236, 'RFD': 0.042071197411003236, 'AVL': 0.03883495145631068, 'TLH': 0.03883495145631068, 'CRW': 0.03883495145631068, 'ASE': 0.03883495145631068, 'RAP': 0.03883495145631068, 'ROA': 0.03883495145631068, 'LAN': 0.03883495145631068, 'SBP': 0.03883495145631068, 'BIL': 0.03883495145631068, 'YUM': 0.03559870550161812, 'AZO': 0.03559870550161812, 'FAR': 0.03559870550161812, 'RDM': 0.03559870550161812, 'FAI': 0.03559870550161812, 'CRP': 0.03236245954692557, 'MOB': 0.03236245954692557, 'PHF': 0.03236245954692557, 'PSC': 0.03236245954692557, 'EVV': 0.03236245954692557, 'MGM': 0.02912621359223301, 'MFE': 0.02912621359223301, 'DRO': 0.02912621359223301, 'ACY': 0.02912621359223301, 'MLB': 0.02912621359223301, 'CPR': 0.02912621359223301, 'GTF': 0.02912621359223301, 'TRI': 0.02912621359223301, 'PSE': 0.02912621359223301, 'MQT': 0.02912621359223301, 'HRL': 0.025889967637540454, 'LFT': 0.025889967637540454, 'BMI': 0.025889967637540454, 'HLN': 0.025889967637540454, 'KTN': 0.025889967637540454, 'ILM': 0.022653721682847898, 'PFN': 0.022653721682847898, 'SGU': 0.022653721682847898, 'CEC': 0.022653721682847898, 'MOD': 0.022653721682847898, 'AGS': 0.022653721682847898, 'BGM': 0.022653721682847898, 'BQN': 0.022653721682847898, 'AEX': 0.01941747572815534, 'MLU': 0.01941747572815534, 'GRK': 0.01941747572815534, 'ERI': 0.01941747572815534, 'CWA': 0.01941747572815534, 'MBS': 0.01941747572815534, 'TTN': 0.01941747572815534, 'SCE': 0.01941747572815534, 'ACV': 0.01941747572815534, 'RDD': 0.01941747572815534, 'SPI': 0.01941747572815534, 'STX': 0.01941747572815534, 'MCN': 0.01941747572815534, 'FSM': 0.01941747572815534, 'RST': 0.01941747572815534, 'SIT': 0.01941747572815534, 'CLD': 0.016181229773462785, 'ITO': 0.016181229773462785, 'LWS': 0.016181229773462785, 'SMX': 0.016181229773462785, 'BIS': 0.016181229773462785, 'OGD': 0.016181229773462785, 'SCC': 0.016181229773462785, 'BRW': 0.016181229773462785, 'AKN': 0.016181229773462785, 'CYS': 0.016181229773462785, 'LRD': 0.012944983818770227, 'CLL': 0.012944983818770227, 'COD': 0.012944983818770227, 'GUC': 0.012944983818770227, 'CHO': 0.012944983818770227, 'HTS': 0.012944983818770227, 'SUN': 0.012944983818770227, 'EKO': 0.012944983818770227, 'BLI': 0.012944983818770227, 'IPL': 0.012944983818770227, 'CDC': 0.012944983818770227, 'PVU': 0.012944983818770227, 'TOL': 0.012944983818770227, 'CMI': 0.012944983818770227, 'LSE': 0.012944983818770227, 'DLH': 0.012944983818770227, 'PLN': 0.012944983818770227, 'OTZ': 0.012944983818770227, 'OME': 0.012944983818770227, 'WRG': 0.012944983818770227, 'PSG': 0.012944983818770227, 'YAK': 0.012944983818770227, 'CDV': 0.012944983818770227, 'ADK': 0.012944983818770227, 'PMD': 0.012944983818770227, 'ACK': 0.012944983818770227, 'PIR': 0.012944983818770227, 'CIC': 0.00970873786407767, 'IYK': 0.00970873786407767, 'OXR': 0.00970873786407767, 'ABI': 0.00970873786407767, 'SUX': 0.00970873786407767, 'DLG': 0.00970873786407767, 'EAU': 0.00970873786407767, 'HHH': 0.00970873786407767, 'SLE': 0.00970873786407767, 'ROW': 0.00970873786407767, 'LCH': 0.006472491909385114, 'BPT': 0.006472491909385114, 'BRO': 0.006472491909385114, 'FLG': 0.006472491909385114, 'TEX': 0.006472491909385114, 'BTM': 0.006472491909385114, 'EYW': 0.006472491909385114, 'GTR': 0.006472491909385114, 'BQK': 0.006472491909385114, 'CSG': 0.006472491909385114, 'DHN': 0.006472491909385114, 'GNV': 0.006472491909385114, 'LYH': 0.006472491909385114, 'ABY': 0.006472491909385114, 'MEI': 0.006472491909385114, 'FAY': 0.006472491909385114, 'APF': 0.006472491909385114, 'ILG': 0.006472491909385114, 'VLD': 0.006472491909385114, 'ISO': 0.006472491909385114, 'OAJ': 0.006472491909385114, 'TUP': 0.006472491909385114, 'FLO': 0.006472491909385114, 'TYR': 0.006472491909385114, 'SPS': 0.006472491909385114, 'TXK': 0.006472491909385114, 'LAW': 0.006472491909385114, 'ACT': 0.006472491909385114, 'GGG': 0.006472491909385114, 'SJT': 0.006472491909385114, 'DBQ': 0.006472491909385114, 'GFK': 0.006472491909385114, 'MOT': 0.006472491909385114, 'ELM': 0.006472491909385114, 'CMX': 0.006472491909385114, 'RHI': 0.006472491909385114, 'ALO': 0.006472491909385114, 'BET': 0.006472491909385114, 'ADQ': 0.006472491909385114, 'MTH': 0.006472491909385114, 'SOP': 0.006472491909385114, 'LWB': 0.006472491909385114, 'GLH': 0.006472491909385114, 'MKC': 0.006472491909385114, 'PUB': 0.006472491909385114, 'EWN': 0.006472491909385114, 'WYS': 0.006472491909385114, 'YKM': 0.006472491909385114, 'INL': 0.006472491909385114, 'BJI': 0.006472491909385114, 'GST': 0.006472491909385114, 'BFF': 0.003236245954692557}

            In-degree and out-degree

            Directed graphs use in-degree (the number of edges pointing to a node) and out-degree (the number of edges pointing away from a node). NetworkX provides functions to directly view the in-degree and out-degree of each node.

            28

            {'ATL': 186, 'ORD': 153, 'DFW': 132, 'MSP': 130, 'SLC': 122, 'DEN': 122, 'IAH': 117, 'DTW': 115, 'CVG': 115, 'LAS': 92, 'EWR': 92, 'LAX': 91, 'PHX': 88, 'MCO': 82, 'MEM': 79, 'CLT': 73, 'CLE': 71, 'JFK': 70, 'SFO': 70, 'IAD': 69, 'BWI': 68, 'TPA': 62, 'BOS': 62, 'SEA': 59, 'MDW': 58, 'PHL': 58, 'LGA': 58, 'AUS': 57, 'SAN': 53, 'DCA': 53, 'FLL': 52, 'STL': 51, 'MCI': 48, 'MIA': 47, 'BNA': 46, 'PIT': 45, 'RDU': 45, 'MKE': 44, 'PDX': 43, 'SMF': 41, 'MSY': 41, 'SAT': 41, 'ABQ': 39, 'OAK': 38, 'ONT': 37, 'IND': 36, 'JAX': 35, 'CMH': 34, 'BDL': 33, 'HOU': 32, 'RSW': 32, 'SDF': 30, 'SJC': 29, 'BHM': 29, 'TUS': 29, 'ANC': 29, 'OKC': 28, 'TUL': 28, 'SNA': 27, 'PBI': 27, 'HNL': 25, 'OMA': 24, 'SJU': 24, 'BUF': 23, 'BOI': 23, 'GSO': 23, 'COS': 23, 'ELP': 22, 'RNO': 22, 'ORF': 22, 'MSN': 22, 'RIC': 22, 'LIT': 21, 'PVD': 21, 'MHT': 19, 'GEG': 19, 'ROC': 19, 'OGG': 19, 'XNA': 18, 'PSP': 18, 'ALB': 17, 'SAV': 17, 'CHS': 17, 'TYS': 17, 'LGB': 17, 'DAL': 16, 'GRR': 16, 'SYR': 16, 'DSM': 16, 'BUR': 15, 'SRQ': 15, 'DAY': 15, 'FAT': 15, 'SBA': 15, 'GSP': 14, 'PWM': 14, 'SGF': 14, 'JAN': 13, 'MDT': 13, 'LEX': 13, 'CAK': 13, 'EGE': 13, 'HPN': 12, 'HSV': 12, 'FSD': 12, 'CAE': 11, 'MYR': 11, 'HDN': 11, 'KOA': 11, 'MRY': 11, 'MLI': 11, 'BZN': 11, 'ICT': 10, 'BTV': 10, 'MFR': 10, 'FNT': 10, 'ISP': 9, 'BFL': 9, 'PNS': 9, 'ABE': 9, 'MTJ': 9, 'CID': 9, 'JAC': 9, 'AVP': 9, 'MSO': 9, 'STT': 9, 'MAF': 8, 'BTR': 8, 'GPT': 8, 'CRW': 8, 'RAP': 8, 'EUG': 8, 'FCA': 8, 'SWF': 8, 'LBB': 7, 'DAB': 7, 'SHV': 7, 'VPS': 7, 'CHA': 7, 'BGR': 7, 'LIH': 7, 'SBP': 7, 'ACY': 7, 'PSC': 7, 'IDA': 7, 'BIL': 7, 'FAR': 7, 'LNK': 7, 'TVC': 7, 'JNU': 7, 'AMA': 6, 'AVL': 6, 'TLH': 6, 'ASE': 6, 'GRB': 6, 'LAN': 6, 'YUM': 6, 'ATW': 6, 'AZO': 6, 'RDM': 6, 'PIA': 6, 'FAI': 6, 'CRP': 5, 'MOB': 5, 'FWA': 5, 'ROA': 5, 'SBN': 5, 'PHF': 5, 'MLB': 5, 'SGU': 5, 'GTF': 5, 'EVV': 5, 'TRI': 5, 'HRL': 4, 'LFT': 4, 'MGM': 4, 'MFE': 4, 'BMI': 4, 'ILM': 4, 'DRO': 4, 'GJT': 4, 'TTN': 4, 'PFN': 4, 'HLN': 4, 'LWS': 4, 'CEC': 4, 'MOD': 4, 'PSE': 4, 'MQT': 4, 'KTN': 4, 'AEX': 3, 'MLU': 3, 'GRK': 3, 'ERI': 3, 'CLD': 3, 'CWA': 3, 'ITO': 3, 'MBS': 3, 'SCE': 3, 'SUN': 3, 'EKO': 3, 'CPR': 3, 'BLI': 3, 'ACV': 3, 'RDD': 3, 'BIS': 3, 'RFD': 3, 'STX': 3, 'AGS': 3, 'BGM': 3, 'BQN': 3, 'FSM': 3, 'RST': 3, 'SIT': 3, 'BRW': 3, 'AKN': 3, 'LRD': 2, 'CLL': 2, 'COD': 2, 'GUC': 2, 'CHO': 2, 'HTS': 2, 'TWF': 2, 'CIC': 2, 'IPL': 2, 'IYK': 2, 'OXR': 2, 'SPI': 2, 'TOL': 2, 'CMI': 2, 'LSE': 2, 'DLH': 2, 'PLN': 2, 'OTZ': 2, 'OME': 2, 'SCC': 2, 'WRG': 2, 'PSG': 2, 'YAK': 2, 'CDV': 2, 'ADK': 2, 'PMD': 2, 'ACK': 2, 'SLE': 2, 'PIR': 2, 'LCH': 1, 'BPT': 1, 'BRO': 1, 'FLG': 1, 'TEX': 1, 'PIH': 1, 'BTM': 1, 'SMX': 1, 'EYW': 1, 'GTR': 1, 'BQK': 1, 'CSG': 1, 'DHN': 1, 'GNV': 1, 'MCN': 1, 'LYH': 1, 'ABY': 1, 'MEI': 1, 'FAY': 1, 'APF': 1, 'ILG': 1, 'VLD': 1, 'ISO': 1, 'OAJ': 1, 'TUP': 1, 'FLO': 1, 'TYR': 1, 'ABI': 1, 'SPS': 1, 'TXK': 1, 'LAW': 1, 'ACT': 1, 'GGG': 1, 'SJT': 1, 'DBQ': 1, 'GFK': 1, 'MOT': 1, 'SUX': 1, 'ELM': 1, 'CMX': 1, 'RHI': 1, 'ALO': 1, 'BET': 1, 'ADQ': 1, 'DLG': 1, 'MTH': 1, 'EAU': 1, 'SOP': 1, 'HHH': 1, 'LWB': 1, 'GLH': 1, 'MKC': 1, 'EWN': 1, 'WYS': 1, 'YKM': 1, 'INL': 1, 'BJI': 1, 'GST': 1, 'ROW': 1, 'CDC': 0, 'OGD': 0, 'PVU': 0, 'CYS': 0, 'BFF': 0, 'PUB': 0}

            Some data pre-processing

            In this section, you process the data to group flights by route and select a subset to create a smaller table for example purposes.

            30

            31

            OK
            32

            33

            As noted in the results of the previous cell, the cell results have been stored in the PySpark DataFrame _sqldf. The next cell uses that DataFrame.

            35